# Greenfield Development with Claude Code

> Full text of a ~30-minute live talk by Martin Brian (Senior AI Engineer) at Marvik's "¡IA en vivo!" meetup — Montevideo, 2026-05-28. A practical session on building an app from zero with Claude Code, bookended by a live audience-played showcase game ("Simon Sees") that is regenerated on stage from a written spec.

The deck is a bilingual (English/Spanish) reveal.js presentation at https://meetup2805.martinbrian.com/presentation.html. This file renders the whole talk as plain markdown: every slide's content, the frameworks, the scored-tool tables, and the source attributions. The talk runs in four "rounds" plus a reveal.

Core thesis: **we can't keep up with AI tooling, but we can teach the thought process for choosing it.** Tools are interchangeable; the decision framework is the durable skill. Every framework in the talk is testable against the showcase game, and the game is rebuilt from a spec to prove the loop closes.

Scoring scale used throughout: **A+ · A · B · C · D · F** (A is good, F is fail). Each tool is scored on four gates — Observability (Obs), Cost, Simplicity (Simp), Correctness (Corr) — against a per-project budget. A tool is dropped if it falls below budget on any single gate. No averaging.

---

## Round 1 — The Game

### Simon Sees (the showcase game)

A competition the audience plays live: **The Room** (you) vs. **The Rival** (a pre-recorded run). No phones, no cloud — a webcam watches the room and only the vision model (SAM 3.1) runs live.

Anatomy of one round:

1. **Build-up** — music plays; the condition lands in silence.
2. **Green light** — move, cluster, coordinate (~5s).
3. **Freeze** — one snapshot; the doll is checking.
4. **Score** — SAM 3.1 counts who matched; the meter moves.

### The twist — "Simon says"

- **Simon round** — Host: *"Simon says — show me something red."* Match it and coverage scores **positive**.
- **Feint round** — Host: *"Show me something red"* (no "Simon says"). It's a trap — matchers score **negative**.

Restraint is an action: holding still on a feint scores too.

---

## Round 2 — Why: Software is Complex

### The four domains (Cynefin)

Cynefin (kuh-NEV-in) sorts problems into four domains, each with its own approach:

- **Clear** — Sense → Categorize → Respond. Best practice. Known knowns; checklists work (password reset).
- **Complicated** — Sense → Analyze → Respond. Good practice. Known unknowns; several right answers; specialists help.
- **Complex** — Probe → Sense → Respond. Emergent practice. Unknown unknowns; cause and effect clear only in hindsight. **Software + AI lives here.**
- **Chaotic** — Act → Sense → Respond. Novel practice. No cause-effect; stabilize first, ask later (Apollo 13).
- **Disorder** — you don't know which domain you're in.

Takeaway: software development is Complex. Best practices don't exist here — only emergent ones.

### AI is an amplifier

> "AI magnifies the strengths of high-performing organizations **and** the dysfunctions of struggling ones." — Google DORA, framed as "AI as amplifier" by Nathen Harvey.

So where does it land? Three places the amplifier rule actually bites:

- **The system, not the tool** — returns come from the platform, the workflows, the team. Model = multiplier. Org = integer.
- **Code is a liability** — operating cost > build cost. More code without oversight = more verification debt.
- **Local wins ≠ global wins** — without foundations, local productivity drowns in downstream chaos.

---

## Round 3 — How: Rules of Thumb

We can't keep up with tools; we can teach the thinking.

### Pre-mortem

The exercise: *"It's talk day. The demo died publicly. What killed it?"* Surface the Top Ten failure modes before they're real. Examples raised: Wi-Fi flakes during the regen; projector cable dies; SAM re-downloads weights mid-show; the room is too dim and SAM mis-counts; a slide is stale by talk day; the regen produces something that doesn't run; a live model is called on stage and 502s.

### Close the loop

Each failure mode earns an agentic mitigation. Example: "slide goes stale" → a nightly skill re-checks claims against the source repo and opens a PR on drift. You don't wait for the Top Ten — even a "Maybe" earns an agent. The pre-mortem isn't an exercise; it's a backlog.

### LLMs are coherence engines, not truth engines

- **Vibing**: prompt → LLM → app. Looks right; you can't tell *which 10%* is wrong without running it.
- **Rewilded SE**: prompt → LLM → tool → fact. The LLM writes the **tool that retrieves the fact**. The fact is the verdict.

Coherence ≠ truth. The fact lives outside the model.

### "Great instinct." (the verifier)

A composite of real exchanges. Ask an LLM *"should I rewrite our auth in Rust this sprint?"* and it replies *"Great instinct — Rust would eliminate a whole class of bugs in your auth layer. Let me sketch the migration…"*. Ask *"…is that actually a good idea?"* and it reverses: *"Honestly, no. Three open tickets, no Rust expertise, and the bugs aren't in auth."* That's coherence, not judgment. **The verifier has to live outside the conversation** — golden tests, hooks, CI: the seam between the model and the verdict.

### Where do humans fit?

> "Without comprehension, engineering becomes belief." — after Wardley & Girba, *Rewilding Software Engineering*.

The code is the blueprint; the "spec" is closer to a wishlist — the *code* is what makes the decisions. Cautionary tale: **Knight Capital** lost ~$440M in 45 minutes (Aug 1, 2012) when a deploy left dormant code active on 1 of 8 servers.

---

## Claude Code primitives

### The primitives, at a glance

Six primitives. Part 1 is deterministic / mechanical; Part 2 is probabilistic / model-driven. Slash commands and plugins are *packaging*, not primitives — they bundle the six.

| Primitive | What it is | When it fires | Key point |
|---|---|---|---|
| Permissions | allow / ask / deny rules in `settings.json` | every tool call | Owner: Claude Code, not the model (deterministic) |
| Hooks | shell commands on lifecycle events (PreToolUse, PostToolUse, Stop, SessionStart…) | on the event | deterministic |
| Sub-agents | isolated context, own tools + prompt | when spawned by the main agent | parallel work · context protection · specialized review |
| MCP servers | external tools via Model Context Protocol (stdio · HTTP · SSE) | when the model calls them | live data, APIs · model-driven trigger |
| CLAUDE.md | markdown loaded in full at session start | always-on context | probabilistic — the model *reads* it, doesn't *obey* it |
| Skills | packaged markdown + scripts | on demand when the *description* matches the prompt | recurring procedures |

**Plugins are packaging, not a primitive** — they bundle the six above.

### Picking a Claude Code primitive (decision tree)

Walk top-down; the first YES wins:

1. Same approval, over and over? → **permissions** (promote to an allow rule).
2. Deterministic auto-fire on a lifecycle event? → **hook**.
3. External system or live data? → **MCP server**.
4. Verbose / parallelizable work to isolate? → **sub-agent**.
5. Applies on every prompt in the project? → **CLAUDE.md** (rules live here too — split with `@imports` to debloat).
6. Anything else recurring → **skill** (the fallback).

### Or let it pick for you

`/claude-automation-recommender` — the decision tree, run by Claude Code:

1. Reads the repo — stack, scripts, repeated rituals, friction points.
2. Recommends hooks · sub-agents · skills · plugins · MCP — each tied to a need, with the *why*.
3. Still your call — run each suggestion past the four gates before you install it.

The meta-loop: Claude Code sets up Claude Code. Best on a cold-start repo or onboarding — it surfaces insight; you decide what to install. The judgment stays yours.

### Permissions — three tiers

- **allow** — same outcome every time: `Read`, `Grep`, `npm test`.
- **ask** — side effects worth eyeballing: `git push`, `npm publish`.
- **deny** — destructive / unrecoverable: `rm -rf`, `--force`.

Heuristic: **default `ask`; promote after the 3rd "yes"; demote after the 1st regret.** (Project convention, not official docs.) It's the lightest fix on the list — first thing to reach for, last thing to skip.

### AI moved the doors (one-way vs two-way)

- **Type 1 — one-way door**: irreversible. Slow down, gather data, commit. Used to be: most custom code.
- **Type 2 — two-way door**: reversible. Move fast, accept being wrong. Now: anything you can regen from a spec.

AI didn't change *where* the doors are — it changed *how many components live on the Type 2 side*. (Bezos 2016, one-way / two-way doors.)

---

## Round 4 — The Tool Audition (the gates)

A project moves through stages; matching the tool to the stage is the engineering.

### The gates ▸ v0.1

Four gates every candidate tool must pass, each scored A+→F:

1. **Observability & Ownership** — see *inside* it: scannable, auditable, no black box, no unapproved external LLMs. Can't observe = don't own.
2. **Correctness of Output** — is the *result* right? Verifiable, falsifiable — or running on faith?
3. **Cost** — $/run, tokens too. `/fast` is great and pricey; the threshold is per-project.
4. **Simplicity & Maintainability** — will it make sense in 3 months? Can a teammate run it without you?

Every gate is scored on the same axis; drop the tool if it falls below budget on **any single gate** — no averaging. These four are v0.1 for this project; yours may add a 5th (Ethics, Privacy, Latency, Compliance).

### Score the tool

1. **Profile the project** — which stage: throwaway, internal, or public? Then weigh cost sensitivity · precision · latency · blast radius · team familiarity. Set a **budget per gate**.
2. **Score each tool** — Obs / Cost / Simp / Corr on A+·A·B·C·D·F. A is good, F is fail. Can't decide A-or-B? Pick B — the letter forces a verdict.
3. **Below budget on any gate → drop it** — no averaging. A gate fails only when it's below the budget you set for it; a D can pass here and sink you there. Engineering lives in the threshold.

### Tools, in the order you reach for them

Scores are **circumstantial** — each row is ONE use case; re-score for yours. The same tool can flip from Reject to Buy when the project changes. (Full table, 30+ tools, in `gates-scored-tools.md`.)

**Step 1 · Project setup**

| Tool | Obs | Cost | Simp | Corr | Use case |
|---|---|---|---|---|---|
| `/init` | A+ | A | A+ | A | scan codebase · draft CLAUDE.md · you review the seams |
| CLAUDE.md (tight: <200 lines, conventions only) | A+ | A | A+ | A | project conventions · always-on context |
| CLAUDE.md (bloated: 500+ lines, conflicting rules, big imports) | C | D | C | D | same tool, wrong use — Claude reads it as *context*, not enforcement |

Run `/init` once per repo. Keep CLAUDE.md tight or you'll regress yourself.

**Step 2 · Daily mode**

| Tool | Obs | Cost | Simp | Corr | Use case |
|---|---|---|---|---|---|
| `/fast` (accelerated Opus speed mode) | A | D | A+ | A | personal/hobby + prototype · one-shot prep, cost-gated |
| Plan mode (propose-then-execute) | A+ | C | A+ | A+ | non-trivial change · catches errors before they ship |

Reach for these for individual tasks; each has a sweet spot.

**Step 3 · Guardrails**

| Tool | Obs | Cost | Simp | Corr | Use case |
|---|---|---|---|---|---|
| Permissions | A+ | A+ | A+ | A+ | allow / ask / deny · the lightest fix |
| claude-code-hooks-mastery | A+ | A+ | C | A | surgical: lint, secret-scan, boundary-check |
| Hooks gone wrong | C | D | D | C | same tool, wrong use — over-engineered; every Claude action stalls |

Use hooks for: lint · secret detection · boundary checks · spec-drift · cost audit · test gating. One hook per concern; keep them simple, fast, single-purpose.

**Step 4 · External data**

| Tool | Obs | Cost | Simp | Corr | Use case |
|---|---|---|---|---|---|
| Context7 MCP | D | C | A+ | A | prototype + internal · live library docs · vendor before public/regulated |
| Playwright MCP | A | A | A | A+ | UI verification · real browser, no hallucinations |
| Slack MCP | C | A | A | B | internal product+ · ops/on-call · lock scope; send is irreversible |
| Vercel MCP | C | A | A | B | internal product+ · deploys + envs · split read/write configs |
| Gmail / Drive MCP | D | A | A+ | A | personal/hobby only · forbidden for internal product+ (client/business data) |
| Generic vendor-API MCP | D | C | A | C | prototype OK · vendor or replicate before internal product+ · the cautionary archetype |

External system → Observability is the gate to watch. Vendor or replicate before prod.

**Step 5 · Community plugins**

| Tool | Obs | Cost | Simp | Corr | Use case |
|---|---|---|---|---|---|
| claude-mermaid | A+ | A | A+ | A+ | diagrams in any repo |
| revealjs-skill | A+ | A+ | A | A | decks like this one |

Pin both to a commit in `.claude-plugin/marketplace.json`: `ref` = branch/tag (drifts), `sha` = exact commit (frozen). Both supported on `github`, `url`, and `git-subdir` sources.

**Step 6 · Famous frameworks**

| Tool | Obs | Cost | Simp | Corr | Use case |
|---|---|---|---|---|---|
| obra/superpowers | A | D | D | A+ | TDD methodology · pay Cost & Simp to buy A+ Corr |
| pr-review-toolkit | A | C | C | A+ | pre-merge review · same trade as superpowers, lighter |
| wshobson/agents | A | C | C | A | cherry-pick 2–3 · don't install the whole marketplace |
| claude-flow | D | F | D | D | personal/hobby demo only · even Corr is D — nothing to buy · drop above |

Cost & Simp can be *paid* when Correctness is the bottleneck — but failing the gate you're buying is still a no. No averaging.

**Step 7 · Around Claude Code**

| Tool | Obs | Cost | Simp | Corr | Use case |
|---|---|---|---|---|---|
| ccusage | A+ | A+ | A+ | A+ | local token/cost analyzer · the no-brainer · default-on |
| claudia | A | A+ | A | A | internal product+ · desktop dashboard for teams that want a UI alongside the CLI |
| claude-code-router | D | A+ | C | D | personal/hobby only · routes to DeepSeek/Gemini · two failing gates · never for internal product+ (client data) |

ccusage makes Cost enforceable; claudia adds a lens; the router is the cautionary tale — same scores, but the recommendation flips from Reject to Buy on a personal hobby project.

### Tactics ▸ how to raise scores

Tactics read as **deltas**: `++` raises a grade · `=` unchanged · `−` small cost. Match the move to the failing gate.

| Tactic | Observability | Cost | Simplicity | Correctness |
|---|---|---|---|---|
| Vendoring (pull the code in) | ++ | + | − | = |
| Version locking (pin models, prompts, data) | + | = | = | ++ |
| Audit hooks (cheap-model checks) | ++ | − | = | + |

(Compose your dev experience from many small tools — after Wardley & Girba, *Rewilding SE*.)

### Take the rubric home

`/claude-tool-audit` — a Claude Code plugin that walks you through scoring a candidate tool against the four gates:

- `audit-tool <tool>` — score one candidate
- `audit-project` — audit a whole repo
- `budget-planner` — set per-gate budgets for a new project

29+ worked audits covering models, MCPs, hooks, frameworks, and wrappers — all parseable, all comparable.

---

## Finale — The Reveal

While Rounds 2–4 are presented, a separate Claude Code session regenerates the opening Simon Sees game from a spec. The talk ends by switching to it.

### Would your gates change?

Re-score the same toolkit against a different brief — e.g. "EMP-774 · task compliance monitor." Same person, same HUD aesthetic; the *use case* shifted. Pick your gates, score again, drop the noise.

### Same gates. Different setup wins.

Two example profiles from the audit framework — different budgets, sometimes different gates. Pick yours.

| Budget | The Game (personal/hobby, 5 min, a laugh) | Surveillance (regulated, 24/7, livelihoods) |
|---|---|---|
| Observability | ≥ C · ok | ≥ A+ · required |
| Cost | ≥ D · 5 min/year | ≥ A · 24/7 runtime |
| Simplicity | ≥ A · wins | ≥ D · layered OK |
| Correctness | ≥ C · false positives are funny | ≥ A+ · false positives cost jobs |
| + Ethics (5th gate) | — n/a | ≥ A+ · added |

The toolkit is portable. The judgment isn't.

### Thank you

Thank you — questions ▸ rebuild ▸ play. The bookend is the deliverable: opening game → frameworks justify the spec → spec rebuilds the game.

Speaker: **Martin Brian** — Senior AI Engineer.

---

## Sources & attributions

- Cynefin — Snowden, D. J., & Boone, M. E., *A Leader's Framework for Decision Making*, HBR, Nov 2007. https://hbr.org/2007/11/a-leaders-framework-for-decision-making
- "AI is an amplifier" — Google DORA, *State of AI-assisted Software Development*; framing by Nathen Harvey. https://cloud.google.com/resources/content/dora-roi-of-ai-assisted-software-development
- Pre-mortem — Klein, G., *Performing a Project Premortem*, HBR, Sep 2007. https://hbr.org/2007/09/performing-a-project-premortem (clustering method: Mountain Goat Software / Mike Cohn).
- Coherence-not-truth & "build the tool that retrieves the fact" — Wardley, S. & Girba, T., *Rewilding Software Engineering*. https://medium.com/feenk/rewilding-software-engineering-900ca95ebc8c — and Wąsowski, J., "Stop writing specs, start writing facts." (paraphrased, not quoted)
- Knight Capital — ~$440M in 45 minutes, Aug 1, 2012. https://en.wikipedia.org/wiki/Knight_Capital_Group
- One-way / two-way doors — Bezos, 2016 Amazon shareholder letter. https://www.aboutamazon.com/news/company-news/2016-letter-to-shareholders
- Claude Code primitives — https://code.claude.com/docs (permissions, hooks, mcp, sub-agents, memory, skills, plugins).
- Cross-cutting frameworks — Choose Boring Technology (mcfunley.com/choose-boring-technology), Wardley Maps (learnwardleymapping.com), Google SRE error budgets (sre.google/sre-book/embracing-risk).

Note: the Wardley/Girba one-liners are paraphrases from the *Rewilding Software Engineering* series, not verbatim quotes. Tool star-counts and grades are circumstantial and were last verified 2026-05-28.