Pilum: From Launch to Production-Ready in 3 Months

In December 2025, I open-sourced Pilum — a multi-cloud deployment CLI that deploys to Cloud Run, Lambda, Azure, Cloudflare Pages, npm, Homebrew, and Docker Hub from a single pilum.yaml. The announcement post covered the architecture: recipes, ingredients, handlers, wave-based execution.

That was the “it compiles and the tests pass” version.

Three months and 40+ pull requests later, Pilum deploys all of SID Technologies — platform-core, Torch, Statio, every website, every npm package, and itself. This post is about everything that broke between “it works” and “it ships production software.”


The Timeline

  • Dec 2, 2025: First commit. Baseline CLI with recipe system. (#1)
  • Dec 4: Homebrew release workflow. Pilum dogfoods its own deployment. (#2-#4)
  • Dec 31: Service graph, --only-changed, file embedding support. The “I need this for real” features. (#22-#24)
  • Jan 3-12: Documentation and bug fixes. The quiet “oh, this doesn’t actually work” phase. (#25-#29)
  • Feb 6-7: The big feature sprint — 8 features in 48 hours. Wave deployments, npm recipe, Cloudflare Pages, Azure Container Apps, Cloud Run Jobs, environment variables, JSON output, history command. (#30-#38)
  • Feb 8-12: The big fix sprint — YAML parsing, package manager issues, build failures, error swallowing, GCP secrets, Cloudflare execution. Everything from the feature sprint broke something. (#42-#49)
  • Feb 24: Security hardening and npm publishing fixes. (#51-#53)
  • Mar-Apr: Memory-based worker allocation, wave ordering bug, orchestrator rewrite. The “I thought this was done” phase. (#58-#62)

The pattern is clear: features ship fast, fixes ship faster, and the real bugs show up a month later.


Wave-Based Deployments Were Broken

Wave-based deployment was the headline feature in #31. Services declare dependencies, Pilum builds a dependency graph, topologically sorts it into waves, and executes waves in parallel.

It worked perfectly in tests. Then I tried to publish npm packages.

The problem: Pilum publishes @sid-technologies/base-ui, which depends on @repo/configs. Both are in the same monorepo. Wave ordering correctly put configs before base-ui. But the worker allocation was distributing work across goroutines without respecting wave boundaries — workers from wave 2 could start before wave 1 fully drained.

The npm registry would reject base-ui because configs hadn’t finished publishing yet. Sometimes it worked (race condition timing), sometimes it didn’t.

The fix (#61) was two things: memory-based worker allocation (don’t spawn more workers than the machine can handle) and a proper wave barrier — wave N+1 doesn’t start until every worker in wave N has completed and reported success. The entire orchestrator got rewritten in #60 to make this possible — runner.go (772 lines) was replaced with execution.go, pipeline.go, steps.go, and waves.go (total: ~773 lines, but now testable and correct).


Error Swallowing

This was the scariest bug (#49). The worker queue was silently swallowing errors from shell commands. A deployment step would fail, the worker would log it, but the orchestrator wouldn’t see the failure. The deploy would “succeed” with broken services.

The root cause: the command worker’s error channel wasn’t being read in all code paths. If a command failed during the capture phase (reading stdout/stderr), the error was logged but not propagated to the result channel.

The fix was straightforward — propagate errors through every code path — but finding it required adding 274 lines of tests to the worker queue and orchestrator. The bug was invisible because the services usually deployed fine. It only surfaced when a Docker build failed mid-stream and Pilum reported success anyway.

Lesson: if you’re building a deployment tool, test the failure paths harder than the success paths. Nobody cares if deploys work. They care enormously when deploys claim to work but didn’t.


GCP Secrets Formatting

GCP Cloud Run expects secrets in a specific format: SECRET_NAME=projects/PROJECT/secrets/NAME/versions/latest. Pilum was passing them as SECRET_NAME=value, which works for environment variables but not for secret references.

Two fixes (#48, #58):

  1. A config parser that detects whether a value is a secret reference or a literal, and formats accordingly.
  2. Environment variable handling that properly separates --set-env-vars from --set-secrets in the gcloud run deploy command.

This is the kind of bug that only shows up when you deploy a real service with real secrets, not when you test against mocked GCP APIs.


Cloudflare Pages: Two Fixes, Same Target

Cloudflare Pages broke twice (#47, #56). The first fix was the recipe itself — the wrangler CLI changed its flags between versions and the recipe had hardcoded the old format.

The second fix was more subtle: the capture worker was incorrectly detecting the end of output from the wrangler process. It would sometimes truncate the deployment URL from the output, so Pilum couldn’t report where the site was deployed. The fix was a one-line change in how the capture worker detects stream completion, but it took an hour to diagnose because wrangler’s output format is inconsistent between deploy targets.


YAML Parsing

The YAML parser (#42) had an encoding issue. Users could write build: go in their pilum.yaml, but the YAML library parsed go as a boolean (true in YAML 1.1). This broke the build step because it expected a string, not a boolean.

Two-line fix: force string type on specific fields during parsing. But it’s the kind of bug that makes you question every YAML-based config format you’ve ever designed.


npm Publishing: A Three-PR Saga

Getting npm packages to publish correctly took three separate PRs (#52, #53, #57):

  1. #52: The npm ingredient needed workspace resolution. In a pnpm monorepo, pnpm publish needs to know which workspace you’re publishing. Added a resolve_workspaces.js script that walks the workspace config and finds the right package.

  2. #53: E2E test infrastructure. After #52 I realized I had no way to test the full publish flow without actually hitting the npm registry. Built a test harness with fixture packages and dry-run publishing.

  3. #57: npm vs pnpm command resolution. Some systems have npm but not pnpm, or vice versa. The completion and local command system needed to detect which package manager is available and use the right one.


Features That Got Added Because We Needed Them

The December launch had the core recipe engine and a few deploy targets. Everything below was added in February because real usage demanded it:

  • Wave-based deployments (#31): Services with dependencies deploy in order. The headline feature that later broke.
  • npm recipe (#32): Publish packages to npm. Required for the monorepo workflow.
  • Cloudflare Pages (#34): Static site deployment. Every SID website uses this.
  • Azure Container Apps (#38): Because “multi-cloud” means more than just GCP.
  • GCP Cloud Run Jobs (#30): For batch processing and migrations, not just long-running services.
  • JSON output (#36): For CI/CD integration. Parse deployment results programmatically.
  • History and status commands (#37, #40): “What did I deploy last? Is it still running?”
  • Environment variable support (#33): Pass variables to recipes at deploy time.
  • Recipes command (#45): List available recipes and their ingredients.
  • GCP Cloud Run connections (#55): VPC connectors, Cloud SQL connections, service-to-service auth.

Ten features in three weeks. Each one spawned 1-2 bug fixes in the following weeks.


The GitHub Action

By February 9, Pilum was stable enough that I didn’t want to keep writing curl | sh install scripts in every CI workflow. So I built pilum-action — a GitHub Action that installs Pilum and runs any command.

- uses: SID-Technologies/pilum-action@v1
  with:
    command: deploy
    tag: ${{ github.event.release.tag_name }}

It auto-detects the runner’s OS and architecture, downloads the right binary, and passes through all credentials via environment variables. Simple wrapper, but it cut 15-20 lines of boilerplate from every deployment workflow.

The action itself hit a bug immediately: the binary name in the release archive included the version number (pilum-v0.3.1-linux-amd64), but the action was looking for pilum. Two quick fixes (#3, #4) and a third one in March when the naming convention changed again.

Every SID repo now uses pilum-action@v1 for deployments instead of raw shell scripts.


The Website

The pilum.dev website launched the same day as the CLI (December 3) — an Astro site with an animated terminal demo, deploy target icons, and full documentation.

Over the next three months it evolved alongside the tool:

  • Dec 5: Automated tag updates — when Pilum releases a new version, the website automatically updates the version displayed in the terminal demo (#4)
  • Jan 3: Full documentation pages — getting started, CLI reference, recipe guides (#5)
  • Feb 7-8: New provider icons as deploy targets were added to the CLI (#7, #8)
  • Feb 11: Documentation for all new features from the February sprint (#9)
  • Feb 16: Design improvements — the site went from “functional docs” to “looks like a real product” (#10)
  • Mar 19: Proper favicons replacing the placeholder (#12)

The website was always the last thing updated after a feature shipped. In hindsight, having docs deploy automatically from the same pilum deploy pipeline would have kept them in sync. That’s a future improvement.


Security Hardening

PR #51 was a comprehensive security audit response. The headline finding: shell injection via pilum.yaml values. If a service name or environment variable contained shell metacharacters ($(rm -rf /)), they’d be passed unsanitized to sh -c.

The fix introduced a shellutil package with:

  • Quote(): POSIX single-quote escaping for string commands
  • ValidateServiceName(): Reject names with shell metacharacters
  • SanitizeHeredocValue(): Escape $, backticks, and backslashes in heredoc values

Also added: symlink traversal prevention, path validation, and safe temporary directory handling.

Pilum’s attack surface is small — users run their own configs — but in CI/CD environments where configs might be generated or templates might include untrusted input, these protections matter.

Later, we added Grype + Syft vulnerability scanning to the CI pipeline (#51 and subsequent). Every dependency is scanned against the NVD database, and builds fail on high-severity CVEs.


The Numbers

Four months of dogfooding (December 2025 — April 2026):

  • 62 PRs merged (17 features, 21 fixes, rest chore/docs)
  • Deploy targets used in production: Cloud Run, Cloudflare Pages, npm, Homebrew (4 of 7)
  • Repos deploying with Pilum: 6+ (platform-core, Torch, Statio, all websites, Pilum itself)
  • Orchestrator rewritten: once (the 2000-line runner.go → modular architecture)
  • Security findings addressed: 11
  • Feb 7, 2025: 8 features shipped in one day. Feb 8-12: 5 bugs found from that sprint.

What I’d Do Differently

  1. E2E tests from day one. Unit tests caught maybe 30% of the bugs on this list. Every serious bug — wave ordering, error swallowing, npm publishing, Cloudflare output — required the full pipeline to reproduce. I built the E2E framework in #53. Should have been #2.

  2. Fewer features per sprint. The Feb 7 feature marathon (8 features in 48 hours) created a week of bug fixes. Shipping 2-3 features with immediate dogfooding would have caught issues faster.

  3. The orchestrator should have been modular from the start. The 772-line runner.go was unmaintainable. The rewrite into execution.go, pipeline.go, steps.go, and waves.go should have been the original architecture. The monolithic runner made every bug harder to find and every fix riskier.


What’s Next

Pilum is stable. It deploys everything SID Technologies builds. The immediate roadmap:

  • More deploy targets as needed (Fly.io, Railway)
  • Recipe marketplace for community contributions
  • Better error messages (the #1 user experience issue)
  • Pilum Cloud (hosted control plane, someday) — but that’s a different product, not an open-source feature

The Takeaway

The gap between “open-source announcement” and “tool I trust with production deploys” was about 40 pull requests. Every real deployment surfaced an edge case that unit tests never caught. The wave ordering bug was invisible for two months because most deploys don’t have inter-service dependencies.

If you’re building developer tools: dogfood them immediately, not eventually. The first month after launch is a second development phase. Budget for it.