# Greenfield Development with Claude Code > Full text of a ~30-minute live talk by Martin Brian (Senior AI Engineer) at Marvik's "¡IA en vivo!" meetup — Montevideo, 2026-05-28. A practical session on building an app from zero with Claude Code, bookended by a live audience-played showcase game ("Simon Sees") that is regenerated on stage from a written spec. The deck is a bilingual (English/Spanish) reveal.js presentation at https://meetup2805.martinbrian.com/presentation.html. This file renders the whole talk as plain markdown: every slide's content, the frameworks, the scored-tool tables, and the source attributions. The talk runs in four "rounds" plus a reveal. Core thesis: **we can't keep up with AI tooling, but we can teach the thought process for choosing it.** Tools are interchangeable; the decision framework is the durable skill. Every framework in the talk is testable against the showcase game, and the game is rebuilt from a spec to prove the loop closes. Scoring scale used throughout: **A+ · A · B · C · D · F** (A is good, F is fail). Each tool is scored on four gates — Observability (Obs), Cost, Simplicity (Simp), Correctness (Corr) — against a per-project budget. A tool is dropped if it falls below budget on any single gate. No averaging. --- ## Round 1 — The Game ### Simon Sees (the showcase game) A competition the audience plays live: **The Room** (you) vs. **The Rival** (a pre-recorded run). No phones, no cloud — a webcam watches the room and only the vision model (SAM 3.1) runs live. Anatomy of one round: 1. **Build-up** — music plays; the condition lands in silence. 2. **Green light** — move, cluster, coordinate (~5s). 3. **Freeze** — one snapshot; the doll is checking. 4. **Score** — SAM 3.1 counts who matched; the meter moves. ### The twist — "Simon says" - **Simon round** — Host: *"Simon says — show me something red."* Match it and coverage scores **positive**. - **Feint round** — Host: *"Show me something red"* (no "Simon says"). It's a trap — matchers score **negative**. Restraint is an action: holding still on a feint scores too. --- ## Round 2 — Why: Software is Complex ### The four domains (Cynefin) Cynefin (kuh-NEV-in) sorts problems into four domains, each with its own approach: - **Clear** — Sense → Categorize → Respond. Best practice. Known knowns; checklists work (password reset). - **Complicated** — Sense → Analyze → Respond. Good practice. Known unknowns; several right answers; specialists help. - **Complex** — Probe → Sense → Respond. Emergent practice. Unknown unknowns; cause and effect clear only in hindsight. **Software + AI lives here.** - **Chaotic** — Act → Sense → Respond. Novel practice. No cause-effect; stabilize first, ask later (Apollo 13). - **Disorder** — you don't know which domain you're in. Takeaway: software development is Complex. Best practices don't exist here — only emergent ones. ### AI is an amplifier > "AI magnifies the strengths of high-performing organizations **and** the dysfunctions of struggling ones." — Google DORA, framed as "AI as amplifier" by Nathen Harvey. So where does it land? Three places the amplifier rule actually bites: - **The system, not the tool** — returns come from the platform, the workflows, the team. Model = multiplier. Org = integer. - **Code is a liability** — operating cost > build cost. More code without oversight = more verification debt. - **Local wins ≠ global wins** — without foundations, local productivity drowns in downstream chaos. --- ## Round 3 — How: Rules of Thumb We can't keep up with tools; we can teach the thinking. ### Pre-mortem The exercise: *"It's talk day. The demo died publicly. What killed it?"* Surface the Top Ten failure modes before they're real. Examples raised: Wi-Fi flakes during the regen; projector cable dies; SAM re-downloads weights mid-show; the room is too dim and SAM mis-counts; a slide is stale by talk day; the regen produces something that doesn't run; a live model is called on stage and 502s. ### Close the loop Each failure mode earns an agentic mitigation. Example: "slide goes stale" → a nightly skill re-checks claims against the source repo and opens a PR on drift. You don't wait for the Top Ten — even a "Maybe" earns an agent. The pre-mortem isn't an exercise; it's a backlog. ### LLMs are coherence engines, not truth engines - **Vibing**: prompt → LLM → app. Looks right; you can't tell *which 10%* is wrong without running it. - **Rewilded SE**: prompt → LLM → tool → fact. The LLM writes the **tool that retrieves the fact**. The fact is the verdict. Coherence ≠ truth. The fact lives outside the model. ### "Great instinct." (the verifier) A composite of real exchanges. Ask an LLM *"should I rewrite our auth in Rust this sprint?"* and it replies *"Great instinct — Rust would eliminate a whole class of bugs in your auth layer. Let me sketch the migration…"*. Ask *"…is that actually a good idea?"* and it reverses: *"Honestly, no. Three open tickets, no Rust expertise, and the bugs aren't in auth."* That's coherence, not judgment. **The verifier has to live outside the conversation** — golden tests, hooks, CI: the seam between the model and the verdict. ### Where do humans fit? > "Without comprehension, engineering becomes belief." — after Wardley & Girba, *Rewilding Software Engineering*. The code is the blueprint; the "spec" is closer to a wishlist — the *code* is what makes the decisions. Cautionary tale: **Knight Capital** lost ~$440M in 45 minutes (Aug 1, 2012) when a deploy left dormant code active on 1 of 8 servers. --- ## Claude Code primitives ### The primitives, at a glance Six primitives. Part 1 is deterministic / mechanical; Part 2 is probabilistic / model-driven. Slash commands and plugins are *packaging*, not primitives — they bundle the six. | Primitive | What it is | When it fires | Key point | |---|---|---|---| | Permissions | allow / ask / deny rules in `settings.json` | every tool call | Owner: Claude Code, not the model (deterministic) | | Hooks | shell commands on lifecycle events (PreToolUse, PostToolUse, Stop, SessionStart…) | on the event | deterministic | | Sub-agents | isolated context, own tools + prompt | when spawned by the main agent | parallel work · context protection · specialized review | | MCP servers | external tools via Model Context Protocol (stdio · HTTP · SSE) | when the model calls them | live data, APIs · model-driven trigger | | CLAUDE.md | markdown loaded in full at session start | always-on context | probabilistic — the model *reads* it, doesn't *obey* it | | Skills | packaged markdown + scripts | on demand when the *description* matches the prompt | recurring procedures | **Plugins are packaging, not a primitive** — they bundle the six above. ### Picking a Claude Code primitive (decision tree) Walk top-down; the first YES wins: 1. Same approval, over and over? → **permissions** (promote to an allow rule). 2. Deterministic auto-fire on a lifecycle event? → **hook**. 3. External system or live data? → **MCP server**. 4. Verbose / parallelizable work to isolate? → **sub-agent**. 5. Applies on every prompt in the project? → **CLAUDE.md** (rules live here too — split with `@imports` to debloat). 6. Anything else recurring → **skill** (the fallback). ### Or let it pick for you `/claude-automation-recommender` — the decision tree, run by Claude Code: 1. Reads the repo — stack, scripts, repeated rituals, friction points. 2. Recommends hooks · sub-agents · skills · plugins · MCP — each tied to a need, with the *why*. 3. Still your call — run each suggestion past the four gates before you install it. The meta-loop: Claude Code sets up Claude Code. Best on a cold-start repo or onboarding — it surfaces insight; you decide what to install. The judgment stays yours. ### Permissions — three tiers - **allow** — same outcome every time: `Read`, `Grep`, `npm test`. - **ask** — side effects worth eyeballing: `git push`, `npm publish`. - **deny** — destructive / unrecoverable: `rm -rf`, `--force`. Heuristic: **default `ask`; promote after the 3rd "yes"; demote after the 1st regret.** (Project convention, not official docs.) It's the lightest fix on the list — first thing to reach for, last thing to skip. ### AI moved the doors (one-way vs two-way) - **Type 1 — one-way door**: irreversible. Slow down, gather data, commit. Used to be: most custom code. - **Type 2 — two-way door**: reversible. Move fast, accept being wrong. Now: anything you can regen from a spec. AI didn't change *where* the doors are — it changed *how many components live on the Type 2 side*. (Bezos 2016, one-way / two-way doors.) --- ## Round 4 — The Tool Audition (the gates) A project moves through stages; matching the tool to the stage is the engineering. ### The gates ▸ v0.1 Four gates every candidate tool must pass, each scored A+→F: 1. **Observability & Ownership** — see *inside* it: scannable, auditable, no black box, no unapproved external LLMs. Can't observe = don't own. 2. **Correctness of Output** — is the *result* right? Verifiable, falsifiable — or running on faith? 3. **Cost** — $/run, tokens too. `/fast` is great and pricey; the threshold is per-project. 4. **Simplicity & Maintainability** — will it make sense in 3 months? Can a teammate run it without you? Every gate is scored on the same axis; drop the tool if it falls below budget on **any single gate** — no averaging. These four are v0.1 for this project; yours may add a 5th (Ethics, Privacy, Latency, Compliance). ### Score the tool 1. **Profile the project** — which stage: throwaway, internal, or public? Then weigh cost sensitivity · precision · latency · blast radius · team familiarity. Set a **budget per gate**. 2. **Score each tool** — Obs / Cost / Simp / Corr on A+·A·B·C·D·F. A is good, F is fail. Can't decide A-or-B? Pick B — the letter forces a verdict. 3. **Below budget on any gate → drop it** — no averaging. A gate fails only when it's below the budget you set for it; a D can pass here and sink you there. Engineering lives in the threshold. ### Tools, in the order you reach for them Scores are **circumstantial** — each row is ONE use case; re-score for yours. The same tool can flip from Reject to Buy when the project changes. (Full table, 30+ tools, in `gates-scored-tools.md`.) **Step 1 · Project setup** | Tool | Obs | Cost | Simp | Corr | Use case | |---|---|---|---|---|---| | `/init` | A+ | A | A+ | A | scan codebase · draft CLAUDE.md · you review the seams | | CLAUDE.md (tight: <200 lines, conventions only) | A+ | A | A+ | A | project conventions · always-on context | | CLAUDE.md (bloated: 500+ lines, conflicting rules, big imports) | C | D | C | D | same tool, wrong use — Claude reads it as *context*, not enforcement | Run `/init` once per repo. Keep CLAUDE.md tight or you'll regress yourself. **Step 2 · Daily mode** | Tool | Obs | Cost | Simp | Corr | Use case | |---|---|---|---|---|---| | `/fast` (accelerated Opus speed mode) | A | D | A+ | A | personal/hobby + prototype · one-shot prep, cost-gated | | Plan mode (propose-then-execute) | A+ | C | A+ | A+ | non-trivial change · catches errors before they ship | Reach for these for individual tasks; each has a sweet spot. **Step 3 · Guardrails** | Tool | Obs | Cost | Simp | Corr | Use case | |---|---|---|---|---|---| | Permissions | A+ | A+ | A+ | A+ | allow / ask / deny · the lightest fix | | claude-code-hooks-mastery | A+ | A+ | C | A | surgical: lint, secret-scan, boundary-check | | Hooks gone wrong | C | D | D | C | same tool, wrong use — over-engineered; every Claude action stalls | Use hooks for: lint · secret detection · boundary checks · spec-drift · cost audit · test gating. One hook per concern; keep them simple, fast, single-purpose. **Step 4 · External data** | Tool | Obs | Cost | Simp | Corr | Use case | |---|---|---|---|---|---| | Context7 MCP | D | C | A+ | A | prototype + internal · live library docs · vendor before public/regulated | | Playwright MCP | A | A | A | A+ | UI verification · real browser, no hallucinations | | Slack MCP | C | A | A | B | internal product+ · ops/on-call · lock scope; send is irreversible | | Vercel MCP | C | A | A | B | internal product+ · deploys + envs · split read/write configs | | Gmail / Drive MCP | D | A | A+ | A | personal/hobby only · forbidden for internal product+ (client/business data) | | Generic vendor-API MCP | D | C | A | C | prototype OK · vendor or replicate before internal product+ · the cautionary archetype | External system → Observability is the gate to watch. Vendor or replicate before prod. **Step 5 · Community plugins** | Tool | Obs | Cost | Simp | Corr | Use case | |---|---|---|---|---|---| | claude-mermaid | A+ | A | A+ | A+ | diagrams in any repo | | revealjs-skill | A+ | A+ | A | A | decks like this one | Pin both to a commit in `.claude-plugin/marketplace.json`: `ref` = branch/tag (drifts), `sha` = exact commit (frozen). Both supported on `github`, `url`, and `git-subdir` sources. **Step 6 · Famous frameworks** | Tool | Obs | Cost | Simp | Corr | Use case | |---|---|---|---|---|---| | obra/superpowers | A | D | D | A+ | TDD methodology · pay Cost & Simp to buy A+ Corr | | pr-review-toolkit | A | C | C | A+ | pre-merge review · same trade as superpowers, lighter | | wshobson/agents | A | C | C | A | cherry-pick 2–3 · don't install the whole marketplace | | claude-flow | D | F | D | D | personal/hobby demo only · even Corr is D — nothing to buy · drop above | Cost & Simp can be *paid* when Correctness is the bottleneck — but failing the gate you're buying is still a no. No averaging. **Step 7 · Around Claude Code** | Tool | Obs | Cost | Simp | Corr | Use case | |---|---|---|---|---|---| | ccusage | A+ | A+ | A+ | A+ | local token/cost analyzer · the no-brainer · default-on | | claudia | A | A+ | A | A | internal product+ · desktop dashboard for teams that want a UI alongside the CLI | | claude-code-router | D | A+ | C | D | personal/hobby only · routes to DeepSeek/Gemini · two failing gates · never for internal product+ (client data) | ccusage makes Cost enforceable; claudia adds a lens; the router is the cautionary tale — same scores, but the recommendation flips from Reject to Buy on a personal hobby project. ### Tactics ▸ how to raise scores Tactics read as **deltas**: `++` raises a grade · `=` unchanged · `−` small cost. Match the move to the failing gate. | Tactic | Observability | Cost | Simplicity | Correctness | |---|---|---|---|---| | Vendoring (pull the code in) | ++ | + | − | = | | Version locking (pin models, prompts, data) | + | = | = | ++ | | Audit hooks (cheap-model checks) | ++ | − | = | + | (Compose your dev experience from many small tools — after Wardley & Girba, *Rewilding SE*.) ### Take the rubric home `/claude-tool-audit` — a Claude Code plugin that walks you through scoring a candidate tool against the four gates: - `audit-tool ` — score one candidate - `audit-project` — audit a whole repo - `budget-planner` — set per-gate budgets for a new project 29+ worked audits covering models, MCPs, hooks, frameworks, and wrappers — all parseable, all comparable. --- ## Finale — The Reveal While Rounds 2–4 are presented, a separate Claude Code session regenerates the opening Simon Sees game from a spec. The talk ends by switching to it. ### Would your gates change? Re-score the same toolkit against a different brief — e.g. "EMP-774 · task compliance monitor." Same person, same HUD aesthetic; the *use case* shifted. Pick your gates, score again, drop the noise. ### Same gates. Different setup wins. Two example profiles from the audit framework — different budgets, sometimes different gates. Pick yours. | Budget | The Game (personal/hobby, 5 min, a laugh) | Surveillance (regulated, 24/7, livelihoods) | |---|---|---| | Observability | ≥ C · ok | ≥ A+ · required | | Cost | ≥ D · 5 min/year | ≥ A · 24/7 runtime | | Simplicity | ≥ A · wins | ≥ D · layered OK | | Correctness | ≥ C · false positives are funny | ≥ A+ · false positives cost jobs | | + Ethics (5th gate) | — n/a | ≥ A+ · added | The toolkit is portable. The judgment isn't. ### Thank you Thank you — questions ▸ rebuild ▸ play. The bookend is the deliverable: opening game → frameworks justify the spec → spec rebuilds the game. Speaker: **Martin Brian** — Senior AI Engineer. --- ## Sources & attributions - Cynefin — Snowden, D. J., & Boone, M. E., *A Leader's Framework for Decision Making*, HBR, Nov 2007. https://hbr.org/2007/11/a-leaders-framework-for-decision-making - "AI is an amplifier" — Google DORA, *State of AI-assisted Software Development*; framing by Nathen Harvey. https://cloud.google.com/resources/content/dora-roi-of-ai-assisted-software-development - Pre-mortem — Klein, G., *Performing a Project Premortem*, HBR, Sep 2007. https://hbr.org/2007/09/performing-a-project-premortem (clustering method: Mountain Goat Software / Mike Cohn). - Coherence-not-truth & "build the tool that retrieves the fact" — Wardley, S. & Girba, T., *Rewilding Software Engineering*. https://medium.com/feenk/rewilding-software-engineering-900ca95ebc8c — and Wąsowski, J., "Stop writing specs, start writing facts." (paraphrased, not quoted) - Knight Capital — ~$440M in 45 minutes, Aug 1, 2012. https://en.wikipedia.org/wiki/Knight_Capital_Group - One-way / two-way doors — Bezos, 2016 Amazon shareholder letter. https://www.aboutamazon.com/news/company-news/2016-letter-to-shareholders - Claude Code primitives — https://code.claude.com/docs (permissions, hooks, mcp, sub-agents, memory, skills, plugins). - Cross-cutting frameworks — Choose Boring Technology (mcfunley.com/choose-boring-technology), Wardley Maps (learnwardleymapping.com), Google SRE error budgets (sre.google/sre-book/embracing-risk). Note: the Wardley/Girba one-liners are paraphrases from the *Rewilding Software Engineering* series, not verbatim quotes. Tool star-counts and grades are circumstantial and were last verified 2026-05-28.