Building with LLMs at Scale: Part 4 - Experiments and Works-in-Progress

In Part 1, Part 2, and Part 3, I covered pain points and solutions that work reliably. This article is different—it’s about experiments, works-in-progress, and lessons from things that didn’t quite pan out.

Not every tool needs to be polished. Some are scaffolding for better ideas. Some solve problems that disappear with faster models. And some teach valuable lessons even when they fail.

The Project Explorer: Solving Yesterday’s Problem

The Original Problem

Before Sonnet 4.5, exploring a codebase with Claude was slow. Reading 20 files meant 20 sequential API calls, token limits to manage, and 10+ minutes of setup time.

Workarounds emerged: naming key files with an @ prefix (@README.md, @main.go) so they’d appear first in directory listings, making them easier for Claude to discover. Some users created special “guide” files that aggregated important context.

I built project-ingest (inspired by gitingest.com) to solve this. The tool would output a single markdown document with the project structure, key file contents, and dependency graph. Claude could ingest this in one shot instead of reading files incrementally.

What Changed

Sonnet 4.5 changed the game, though I’m not entirely sure how. Is it just faster at reading files? Does it batch requests differently? Does it handle context more efficiently? Whatever the implementation, the result is clear: it’s fast enough that project ingestion overhead feels worse than just reading files directly.

Before (Sonnet 3.5):

Run project-ingest → 15 seconds
Claude reads summary → 5 seconds
Total: 20 seconds

After (Sonnet 4.5):

Claude reads 20 files directly → 8 seconds
Total: 8 seconds

The ingester became slower than the problem it solved.

When It’s Still Useful

I haven’t deleted project-ingest because it remains valuable for:

Very large codebases (100+ files): Still faster to get a high-level view
Project snapshots: Capturing codebase state at a point in time
Documentation generation: Creating an overview for human readers
Cross-project analysis: Comparing architecture across multiple projects

But for everyday “help me understand this project” tasks? Obsolete.

The Lesson

Build for today’s constraints, not tomorrow’s. The tool was perfect for its time, but model improvements made it obsolete. That’s okay. The investment taught me patterns I applied elsewhere (like how to efficiently traverse project structures).

When a tool becomes unnecessary because the problem disappeared, that’s a success, not a failure.

Code Review in Emacs: Closing the Loop

The Review Problem

I’m browsing through a codebase—maybe one I wrote months ago, maybe one Claude just generated, maybe something I’m casually exploring. I spot issues: a function that could be clearer, error handling that’s too generic, a repeated pattern that should be abstracted.

The problem: I’m in discovery mode, not fix mode. I don’t want to stop and fix each issue immediately. I want to:

Mark the issue at the exact line while I’m looking at it
Keep browsing without losing flow
Later, batch all issues together and have an LLM fix them all at once

This is where the Code Review Logger comes in. It decouples discovery from fixing.

The Emacs Integration

I built an Emacs mode (code-review-logger.el) that tracks review comments in an org-mode file:

;; While reviewing code in Emacs:
;; SPC r c - Log comment at current line
;; SPC r r - Log comment for selected region
;; SPC r o - Open review log

(defun code-review-log-comment (comment)
  "Log a review comment with file/line tracking"
  (let* ((file (buffer-file-name))
         (line (line-number-at-pos)))
    (code-review-format-entry comment file line "TODO")))

This creates entries in ~/code_review.org:

** TODO [[file:~/repos/memento/src/cli.py::127][cli.py:127]]
   :PROPERTIES:
   :PROJECT: memento
   :TIMESTAMP: [2025-09-30 Mon 14:23]
   :END:
   This error handling is too generic - catch specific exceptions

** TODO [[file:~/repos/memento/src/search.py::89][search.py:89]]
   :PROPERTIES:
   :PROJECT: memento
   :TIMESTAMP: [2025-09-30 Mon 14:25]
   :END:
   Add caching here - search is called repeatedly with same query

The Workflow

Review code in Emacs (with syntax highlighting, jump-to-def, all IDE features)
Mark issues as I find them (SPC r c for quick comment)

Trigger the automated fix process:

Read code-review-llm-prompt-template and follow it

Claude automatically:
- Reads ~/code_review.org for all TODO items
- Fixes each issue in the actual code
- Runs make test after every change
- Marks items as DONE only when tests pass
- Provides a summary of what was fixed

The entire workflow is encoded in a memento note (code-review-llm-prompt-template) that Claude reads. This note contains:

The review format specification
Priority order (correctness → architecture → security → performance)
Testing requirements (always run make test, never leave tests failing)
Guidelines for what makes a good vs. bad review
The complete fix-and-verify process

Why This Works

Batch processing is more efficient than interactive fixes:

Claude sees all issues at once and can plan holistically
No back-and-forth during fixing
Tests run after every change (not just at the end)
Clear audit trail of what was fixed

Emacs integration solves the “review without IDE” problem:

I’m in my editor with all my tools
Jump to definitions, search references, check blame
Clicking org links takes me directly to the code

Structured format means Claude gets precise instructions:

Exact file paths (clickable org-mode links)
Exact line numbers
Context about the issue
Project name for multi-repo workflows

Current State: Automated Fix Process

The system is fully automated for the fix workflow. When I have pending reviews, I simply say:

Read code-review-llm-prompt-template and follow it

Claude then:

Reads the standardized prompt from memento
Processes all TODO items from ~/code_review.org
Fixes issues, runs tests, marks items DONE
Never leaves the codebase with failing tests

The key insight: encoding the entire workflow in a memento note makes it repeatable and consistent. I don’t need to remember the exact prompt or process—it’s all documented and ready to execute.

Future improvements:

Auto-trigger on commit: Git hook that checks for pending reviews before allowing commits
Proactive review suggestions: Claude analyzing code during normal sessions and adding items to the review log
Review metrics: Track what types of issues are most common to improve coding patterns

The Diff Workflow: Bringing Changes Back to Emacs

The Problem

Claude makes changes in the terminal. I want to review them in Emacs. How do I bridge that gap?

The Current Solution

Simple but effective:

# Claude generates changes, I run:
git diff > /tmp/review.diff

# In Emacs:
# Open the diff file
# Use Emacs diff-mode for navigation
# Apply/reject hunks interactively

This works but feels clunky. I’m copying diffs manually, opening files, navigating around.

What I Want

A tighter integration:

Claude signals “I made changes”
Emacs automatically shows the diff in a split window
I review with full IDE context
I approve/reject specific changes
Claude sees my feedback and adjusts

This requires:

MCP server for Emacs communication
Claude code that can signal “review needed”
Emacs mode that listens for review requests
Two-way communication (Claude → Emacs → Claude)

I’ve prototyped pieces of this but nothing production-ready yet.

The Barrier

Building reliable two-way communication between Claude and Emacs is hard:

Emacs server needs to be always-on
Need protocol for structured messages
Need to handle failures gracefully
Race conditions when multiple Claudes talk to one Emacs

I’m experimenting with using memento as the message bus:

Claude writes “review-needed” note
Emacs polls memento for new reviews
Emacs writes feedback to memento
Claude reads feedback

Clunky but doesn’t require real-time communication.

What Didn’t Work: Session Auto-Resume

The Idea

When I restart my computer, I lose all tmux sessions. What if Claude could auto-resume?

# Before shutdown, save session state:
tmux-save-sessions  # Captures all window/pane layouts

# After restart:
tmux-restore-sessions  # Recreates everything

Each session would:

Restore to the correct directory
Read the last prompt from history
Show a summary: “You were working on memento refactoring”

Why It Failed

Context loss is too severe. Even if I restore the directory and prompt, Claude doesn’t remember:

What code was already written
What decisions were made
What tests were run
What bugs were found

I’d need to capture and replay the entire conversation, which means:

Huge token usage (replaying thousands of tokens)
Slow startup (processing all that history)
Potential for Claude to make different decisions on replay

The Lesson

Session continuity requires more than just state restoration. You need:

Explicit checkpoints (memento notes with “current status”)
Clear handoff documents (“Session ended here, next steps are…”)
Project-specific context (not just conversation history)

Instead of auto-resume, I now use explicit handoff notes:

# Session Checkpoint: 2025-09-30 14:30

## What We Did
- Refactored CLI argument parsing to use argparse
- All tests pass
- Committed changes: git log -1

## What's Next
- [ ] Add JSON output support to all commands
- [ ] Update documentation
- [ ] Add integration tests

## Key Decisions
- Using argparse instead of manual parsing for consistency
- All commands must support --json flag

## Files Modified
- src/cli.py (lines 1-89, 127-145)
- src/parser.py (new file)

Next session reads this note and picks up where we left off. Works better than trying to resume the conversation.

Experiments in Progress

1. MCP Coordination Server

Building an MCP server specifically for coordinating parallel LLM sessions:

# Hypothetical API
coordinator.claim_file("src/parser.py", session="A")
coordinator.add_barrier("refactor-complete", required=["A", "B"])
coordinator.wait_for_barrier("refactor-complete")
coordinator.get_session_status("A")  # → "in_progress" | "blocked" | "completed"

This would solve the “stepping on each other” problem when running parallel sessions.

2. Telemetry Mining

I have months of telemetry data (see Part 2). Now I want to mine it:

# Which prompts lead to longest sessions?
# Which projects have the most rework?
# When do I context-switch most?
# Correlation between session length and memory usage?

The goal: optimize my workflow based on data, not intuition.

3. LLM-Generated Architecture Docs

After a major refactor, can Claude generate architecture documentation automatically?

Read all files in src/. Generate an architecture document explaining:
- Key components and their responsibilities
- Data flow through the system
- API boundaries
- Design decisions and trade-offs

Early experiments are promising. The docs aren’t perfect but are good starting points.

Key Learnings

Embrace obsolescence. If a tool becomes unnecessary, that’s progress.

Perfect is the enemy of done. The code review logger works even though it’s not fully automated. Ship it.

Tight integration is hard. Two-way communication between tools (Claude ↔ Emacs) requires careful design.

Explicit beats implicit. Session handoff notes work better than trying to auto-resume from history.

Data reveals patterns. Telemetry showed me I context-switch too often—now I batch similar tasks.

What’s Next

Part 5 (final article) covers using Claude as a learning tool: generating flashcards, creating annotated worksheets, and building a spaced-repetition system for technical concepts.

The code review logger is in the memento repo. The project ingester is at ~/bin/project-ingest. The tmux session tools are in my dotfiles. All MIT licensed—use freely.