Introducing Pilum: A Recipe-Driven Deployment Orchestrator
Today I’m open-sourcing Pilum, a multi-service deployment orchestrator written in Go. This post covers the problem it solves, the architecture decisions, and how to use and extend it.
The Problem
At SID Technologies, we run services across multiple cloud providers and distribution channels. A typical release involves:
- API services deployed to GCP Cloud Run
- CLI tools distributed via Homebrew
- Background workers on Cloud Run with different scaling configs
- Static assets synced to S3
Each platform has its own deployment CLI, authentication model, and configuration format. The cognitive overhead compounds: remembering gcloud run deploy flags versus brew tap semantics versus aws s3 sync options.
Deployments became a copy-paste ritual from shell scripts scattered across our monorepo. Each service had its own deploy.sh, and the scripts inevitably drifted. Authentication got hardcoded. Flags changed between services. When we needed to add a new environment variable to all services, it was 19 manual edits.
I wanted:
- One command to deploy any service to any target
- Declarative configuration instead of imperative scripts
- Parallel execution across services
- Provider-agnostic core with pluggable providers
A Complete Example
Before diving into architecture, let’s see what using Pilum looks like.
Step 1: Create service.yaml in your service directory
name: api-gateway
provider: gcp
project: sid-production
region: us-central1
build:
language: go
version: "1.23"
binary_name: api-gateway
Step 2: Validate your configuration
$ pilum check
✓ Found service: api-gateway
✓ Recipe: gcp-cloud-run
✓ Required fields present: project, region
✓ Build config valid
Step 3: Preview what would happen
$ pilum deploy --tag=v1.2.0 --dry-run
[api-gateway] Step 1/4: build binary
Command: go build -ldflags "-X main.version=v1.2.0" -o dist/api-gateway .
Working dir: services/api-gateway
Timeout: 300s
[api-gateway] Step 2/4: build docker image
Command: docker build -t gcr.io/sid-production/api-gateway:v1.2.0 .
Working dir: services/api-gateway
Timeout: 300s
[api-gateway] Step 3/4: publish to registry
Command: docker push gcr.io/sid-production/api-gateway:v1.2.0
Working dir: /
Timeout: 120s
[api-gateway] Step 4/4: deploy to cloud run
Command: gcloud run deploy api-gateway \
--image=gcr.io/sid-production/api-gateway:v1.2.0 \
--region=us-central1 \
--project=sid-production
Working dir: /
Timeout: 180s
Retries: 2
Dry run complete. No commands executed.
Step 4: Deploy
$ pilum deploy --tag=v1.2.0
[api-gateway] ⏳ build binary
[api-gateway] ✓ build binary (2.3s)
[api-gateway] ⏳ build docker image
[api-gateway] ✓ build docker image (45.2s)
[api-gateway] ⏳ publish to registry
[api-gateway] ✓ publish to registry (12.1s)
[api-gateway] ⏳ deploy to cloud run
[api-gateway] ✓ deploy to cloud run (18.4s)
Deployment complete: 1 service deployed in 78.0s
That’s it. One command. One service deployed.
With multiple services:
$ pilum deploy --tag=v1.2.0 --services=api-gateway,auth-service,billing-service
Step 1/4: build binary
[api-gateway] ✓ (2.1s)
[auth-service] ✓ (1.9s)
[billing-service] ✓ (2.4s)
Step 2/4: build docker image
[api-gateway] ✓ (43.2s)
[auth-service] ✓ (41.8s)
[billing-service] ✓ (44.1s)
Step 3/4: publish to registry
[api-gateway] ✓ (11.2s)
[auth-service] ✓ (10.9s)
[billing-service] ✓ (11.8s)
Step 4/4: deploy to cloud run
[api-gateway] ✓ (17.2s)
[auth-service] ✓ (18.1s)
[billing-service] ✓ (17.8s)
Deployment complete: 3 services deployed in 82.3s
All services build in parallel. All images push in parallel. All deploys happen in parallel. But each step completes before the next begins.
Inspiration: The Roman Pilum
The pilum was the javelin of the Roman legions. Its design was elegant in its specificity: a long iron shank connected to a wooden shaft, with a weighted pyramidal tip. It was engineered for a single purpose—to be thrown once and penetrate the target.
The soft iron shank would bend on impact, preventing the enemy from throwing it back and rendering their shield useless. One weapon. One throw. Mission accomplished.
This resonated with what I wanted from a deployment tool: define the target once, execute once, hit precisely.
How Pilum Relates to Other Tools
Terraform and Pulumi
Terraform/Pulumi excel at provisioning infrastructure: creating VPCs, databases, load balancers. They’re declarative about what resources should exist. But they’re not optimized for the deployment workflow: building code, pushing images, rolling out new versions.
You can make Terraform deploy a Cloud Run service, but you’re fighting the tool’s abstractions. Terraform wants to manage resource state. When you “deploy” via Terraform, you’re really updating a resource’s image tag. The build and push happen outside Terraform in custom scripts or CI.
They’re complementary: Use Terraform to provision your Cloud Run service (define the resource, set IAM policies, configure scaling). Use Pilum to deploy new versions to it (build, push, update).
Tilt and Skaffold
Tilt is excellent for development workflows—live reloading, local Kubernetes clusters, fast iteration. But it’s development-focused. Pilum is production-focused: tag-based deployments, multi-provider support, CI/CD integration.
If Tilt is your hot-reload development server, Pilum is your production deployment pipeline.
Skaffold is Google’s deployment tool for Kubernetes. If you’re all-in on K8s, it’s great. But we deploy to:
- Cloud Run (managed containers, not K8s)
- Homebrew (binaries, not containers)
- S3 (static assets, not workloads)
- Skaffold doesn’t model these workflows. Pilum does.
ko and Earthly
ko is fantastic for deploying Go containers to Kubernetes. Simple, fast, purpose-built. But it’s single-language (Go only) and single-platform (Kubernetes only). We needed multi-language (Go + TypeScript) and multi-platform (Cloud Run + Homebrew + S3).
If you’re deploying one Go service to K8s, use ko. It’s simpler.
Earthly is a build tool that can handle deployment. But Earthfiles are complex—you’re writing imperative scripts in a DSL. And it doesn’t have Pilum’s recipe system. Every service needs its own Earthfile with duplicated logic.
Pilum’s niche: Multi-service, multi-provider deployments with declarative recipes and parallel execution. If you’re deploying one service to one platform, use the platform’s native tool. If you’re orchestrating 20 services across 3 platforms, Pilum might fit.
Shell Scripts
Shell scripts work until they don’t. They’re imperative, hard to test, and the “deployment logic” gets scattered across Makefiles, CI configs, and random bash files. Adding a new provider means duplicating logic across all services.
Pilum inverts this: the deployment logic (the recipe) is centralized and reusable. Services declare what they need (service.yaml), recipes declare how to deploy (recipe.yaml), and the orchestrator coordinates execution.
Architecture
Pilum’s architecture separates three concerns:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Service Config │ │ Recipe │ │ Handlers │
│ (service.yaml) │────▶│ (recipe.yaml) │────▶│ (Go functions) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
WHAT HOW IMPLEMENTATION
Service configs declare what you’re deploying: name, provider, region, build settings. These live in your repo alongside your code.
Recipes define how to deploy to a provider: the ordered sequence of steps, required fields, timeouts. These are YAML files that ship with Pilum.
Handlers implement the actual commands: building Docker images, pushing to registries, calling cloud CLIs. These are Go functions registered at startup.
Service Configuration
A minimal service.yaml:
name: api-gateway
provider: gcp
project: my-project
region: us-central1
build:
language: go
version: "1.23"
The provider field determines which recipe is used. All other fields are validated against that recipe’s requirements.
Recipe System
Recipes are the core abstraction. Here’s the GCP Cloud Run recipe:
name: gcp-cloud-run
description: Deploy to Google Cloud Run
provider: gcp
service: cloud_run
required_fields:
- name: project
description: GCP project ID
type: string
- name: region
description: GCP region to deploy to
type: string
steps:
- name: build binary
execution_mode: service_dir
timeout: 300
- name: build docker image
execution_mode: service_dir
timeout: 300
- name: publish to registry
execution_mode: root
timeout: 120
- name: deploy to cloud run
execution_mode: root
timeout: 180
default_retries: 2
The recipe declares:
- Required fields: Validated before any execution. If your service.yaml is missing project, you get an error immediately, not 10 minutes into a build.
- Steps: Ordered sequence of operations. Each step has a name, execution mode, and timeout.
- Execution mode: service_dir runs in the service’s directory, root runs from the project root.
- Retries: Some steps (like deploy) can retry on failure.
Recipe Validation
The validation logic uses reflection to check service configs against recipe requirements:
func (r *Recipe) ValidateService(svc *serviceinfo.ServiceInfo) error {
for _, field := range r.RequiredFields {
value := getServiceField(svc, field.Name)
if value == "" && field.Default == "" {
return errors.New("recipe '%s' requires field '%s': %s",
r.Name, field.Name, field.Description)
}
}
return nil
}
The getServiceField function first checks a hardcoded map of common fields, then falls back to the raw config map, and finally uses reflection as a last resort. This gives us type safety for known fields while remaining flexible for custom recipe requirements.
Command Registry
Steps map to handlers via a pattern-matching registry:
type StepHandler func(ctx StepContext) any
type CommandRegistry struct {
handlers map[string]StepHandler
}
func (cr *CommandRegistry) Register(pattern string, provider string, handler StepHandler) {
key := cr.buildKey(pattern, provider)
cr.handlers[key] = handler
}
func (cr *CommandRegistry) GetHandler(stepName string, provider string) (StepHandler, bool) {
// Try provider-specific first, fall back to generic
// Pattern matching is case-insensitive with partial match support
}
This design allows:
- Generic handlers (
buildworks for any provider) - Provider-specific overrides (
deploy:gcpvsdeploy:aws) - Partial matching (
build binarymatches thebuildhandler)
Handlers return any because commands can be strings or string slices:
func buildHandler(ctx StepContext) any {
return []string{
"go", "build",
"-ldflags", fmt.Sprintf("-X main.version=%s", ctx.Tag),
"-o", fmt.Sprintf("dist/%s", ctx.Service.Name),
".",
}
}
Parallel Execution with Step Barriers
The orchestrator executes services in parallel within each step, but steps execute sequentially:
Time →
│
│ Step 1: Build Binary
│ ┌────────────────────────────────────┐
│ │ service-a │ service-b │ service-c │ ← Parallel
│ └────────────────────────────────────┘
│ ↓ ↓ ↓
│ ════════════════════════════════════ ← Barrier
│ ↓ ↓ ↓
│ Step 2: Build Docker Image
│ ┌────────────────────────────────────┐
│ │ service-a │ service-b │ service-c │ ← Parallel
│ └────────────────────────────────────┘
│ ↓ ↓ ↓
│ ════════════════════════════════════ ← Barrier
│ ↓ ↓ ↓
│ Step 3: Deploy
│ ┌────────────────────────────────────┐
│ │ service-a │ service-b │ service-c │ ← Parallel
│ └────────────────────────────────────┘
This ensures dependencies are satisfied: you can’t push an image that hasn’t been built yet.
The implementation:
func (r *Runner) Run() error {
maxSteps := r.findMaxSteps()
for stepIdx := 0; stepIdx < maxSteps; stepIdx++ {
err := r.executeStep(stepIdx)
if err != nil {
return err // Fail fast
}
}
return nil
}
Within each step, a worker pool processes services concurrently:
func (r *Runner) executeTasksParallel(tasks []stepTask) error {
semaphore := make(chan struct{}, r.getWorkerCount())
for _, t := range tasks {
go func() {
semaphore <- struct{}{} // acquire
defer func() { <-semaphore }() // release
result := r.executeTask(task.service, task.step)
// ...
}()
}
}
This means:
- All services build in parallel (step 1)
- Once all builds complete, all pushes happen in parallel (step 2)
- Once all pushes complete, all deploys happen in parallel (step 3)
The barrier between steps ensures dependencies are satisfied. You can’t push an image that hasn’t been built.
Variable Substitution
Recipe commands support variable substitution:
func (r *Runner) substituteVars(cmd any, svc serviceinfo.ServiceInfo) any {
replacer := strings.NewReplacer(
"${name}", svc.Name,
"${service.name}", svc.Name,
"${provider}", svc.Provider,
"${region}", svc.Region,
"${project}", svc.Project,
"${tag}", r.options.Tag,
)
// Handle string, []string, and []any
}
This allows recipes to use service-specific values without hardcoding:
steps:
- name: deploy
command: gcloud run deploy ${name} --region=${region} --project=${project}
Error Handling and Retries
Steps can specify retry behavior:
steps:
- name: deploy to cloud run
timeout: 180
default_retries: 2
If a deploy fails (network timeout, rate limit, transient cloud error), Pilum retries automatically with exponential backoff:
func (r *Runner) executeWithRetry(task stepTask) error {
maxRetries := task.step.DefaultRetries
backoff := 1 * time.Second
for attempt := 0; attempt <= maxRetries; attempt++ {
err := r.execute(task)
if err == nil {
return nil
}
if attempt < maxRetries {
time.Sleep(backoff)
backoff *= 2
}
}
return fmt.Errorf("failed after %d attempts", maxRetries+1)
}
Fail-fast by default: If a step fails for any service (after retries), the entire deployment stops. No partial deployments. This is configurable via —continue-on-error, but we recommend fail-fast for production.
Rollback: Not yet implemented. Currently, rollback is manual (redeploy the previous tag). This is on the roadmap.
Testing and Validation
Recipe validation happens at load time:
func LoadRecipe(path string) (*Recipe, error) {
data, err := os.ReadFile(path)
if err != nil {
return nil, err
}
var recipe Recipe
if err := yaml.Unmarshal(data, &recipe); err != nil {
return nil, fmt.Errorf("invalid recipe YAML: %w", err)
}
// Validate required fields are present
if recipe.Name == "" || recipe.Provider == "" {
return nil, errors.New("recipe missing required fields")
}
// Validate steps
for _, step := range recipe.Steps {
if step.Name == "" {
return nil, errors.New("step missing name")
}
if step.ExecutionMode != "service_dir" && step.ExecutionMode != "root" {
return nil, fmt.Errorf("invalid execution mode: %s", step.ExecutionMode)
}
}
return &recipe, nil
}
Invalid recipes fail immediately, not during deployment.
Service validation happens before execution:
$ pilum check
✓ Recipe found: gcp-cloud-run
✓ All required fields present: project, region
✓ Build config valid
✗ Error: field 'region' is required but missing
Fix your service.yaml and try again.
Dry-run mode lets you preview without executing:
$ pilum deploy --dry-run --tag=v1.0.0
This shows exactly what commands would run, with variable substitution applied, without actually running them.
Handler testing uses Go’s standard testing:
func TestBuildHandler(t *testing.T) {
ctx := StepContext{
Service: serviceinfo.ServiceInfo{Name: "test-service"},
Tag: "v1.0.0",
}
result := buildHandler(ctx)
cmd, ok := result.([]string)
if !ok {
t.Fatal("expected []string")
}
if cmd[0] != "go" || cmd[1] != "build" {
t.Errorf("unexpected command: %v", cmd)
}
}
We test each handler in isolation, then integration test full recipes against staging environments.
Creating Custom Recipes
Want to deploy to a platform Pilum doesn’t support yet? Create a recipe in your fork or publish a PR.
Example: Deploy to Fly.io
Create recipes/fly-io.yaml
name: fly-io
description: Deploy to Fly.io
provider: fly
service: fly_io
required_fields:
- name: app_name
description: Fly.io app name
type: string
- name: region
description: Fly.io region
type: string
default: "sea"
steps:
- name: build docker image
execution_mode: service_dir
timeout: 300
- name: deploy to fly
execution_mode: service_dir
command: flyctl deploy --app ${app_name} --region ${region}
timeout: 180
default_retries: 1
Now create a service that uses it:
name: my-api
provider: fly
app_name: my-api-production
region: sea
build:
language: go
version: "1.23"
If the build docker image step matches an existing handler (Pilum has a generic Docker build handler), it just works. If not, you can register a custom handler:
package handlers
import "github.com/SID-Technologies/Pilum/pkg/runner"
func init() {
runner.DefaultRegistry.Register("deploy to fly", "fly", flyDeployHandler)
}
func flyDeployHandler(ctx runner.StepContext) any {
return []string{
"flyctl", "deploy",
"--app", ctx.Service.Get("app_name"),
"--region", ctx.Service.Region,
}
}
Compile your fork:
go build -o pilum ./cmd/pilum
Now Pilum supports Fly.io.
We’d love PRs for new providers. See CONTRIBUTING.md for guidelines.
Why Open Source
Deployment tools have network effects. The more providers supported, the more useful the tool. But I can’t personally add recipes for every platform—I don’t use AWS ECS, Azure Container Apps, or Render.
Open source lets the community extend Pilum to their platforms. If you deploy to Render, you can contribute a recipe. If you use Earthly for builds, you can add a handler. The recipe system is designed for this.
Trust and transparency matter for deployment tools. These tools run in CI/CD, have access to credentials, and can push to production. Closed source deployment tools ask for a lot of trust. Open source lets you:
Audit the code
- Verify it’s not doing anything sketchy
- Fork it if we make decisions you disagree with
- Contribute fixes when you find bugs
Dogfooding as validation: Pilum deploys itself to Homebrew. The recipes/homebrew.yaml recipe is how we release new versions:
name: homebrew
description: Build and release to Homebrew tap
provider: homebrew
service: package
required_fields:
- name: name
description: Binary name and Homebrew formula name
type: string
- name: description
description: Short description for the Homebrew formula
type: string
- name: license
description: SPDX license identifier (e.g., MIT, Apache-2.0, BSL-1.1)
type: string
- name: homebrew.project_url
description: Full repository URL where releases are hosted (e.g., https://github.com/org/project)
type: string
- name: homebrew.tap_url
description: Full repository URL for the Homebrew tap (e.g., https://github.com/org/Homebrew-tap)
type: string
- name: homebrew.token_env
description: Environment variable name containing the auth token (e.g., GH_TOKEN, HOMEBREW_TAP_TOKEN)
type: string
steps:
# Step 1: Build binaries for all platforms (darwin/linux, amd64/arm64)
- name: build binaries
execution_mode: root
timeout: 300
tags:
- build
# Step 2: Create tar.gz archives for each binary
- name: create archives
execution_mode: root
timeout: 60
tags:
- build
# Step 3: Generate SHA256 checksums
- name: generate checksums
execution_mode: root
timeout: 30
tags:
- build
# Step 4: Update Homebrew formula with new version and checksums
- name: update formula
execution_mode: root
timeout: 30
tags:
- deploy
# Step 5: Push updated formula to Homebrew tap repository
- name: push to tap
execution_mode: root
timeout: 60
tags:
- deploy
If this breaks, we can’t ship. That’s a powerful incentive to keep it working.
Getting Started
Install via Homebrew:
brew tap sid-technologies/pilum
brew install pilum
Initialize in your project:
cd my-project
pilum init
This creates a sample service.yaml:
name: my-service
provider: gcp # or aws, homebrew, etc.
project: my-gcp-project
region: us-central1
build:
language: go
version: "1.23"
Validate your configuration:
pilum check
Deploy:
pilum deploy --tag=v1.0.0
Deploy specific services:
pilum deploy --tag=v1.0.0 --services=api,worker
Preview without executing:
pilum deploy --dry-run --tag=v1.0.0
Real-World Usage at SID
At SID, we use Pilum to deploy 19 services:
$ pilum deploy --tag=v2.3.0
Step 1/4: build binary
[authentication] ✓ (2.1s)
[billing] ✓ (2.3s)
[calendar] ✓ (1.9s)
[kanban] ✓ (2.2s)
[notifications] ✓ (2.0s)
... (14 more services)
Step 2/4: build docker image
[authentication] ✓ (43s)
[billing] ✓ (41s)
... (17 more in parallel)
Step 3/4: publish to registry
[authentication] ✓ (11s)
[billing] ✓ (12s)
... (17 more in parallel)
Step 4/4: deploy to cloud run
[authentication] ✓ (18s)
[billing] ✓ (17s)
... (17 more in parallel)
Deployment complete: 19 services deployed in 45s
Before Pilum, this took 30+ minutes (deploying services serially). Now it takes under 3 minutes (parallel execution).
Our metrics after 3 months of using Pilum:
- Deployment time: 30 min → 45s (-97.5%)
- Failed deployments: 12% → 2% (validation catches issues early)
- Time to add new service: 30 min → 5 min (copy service.yaml template)
Known Limitations
Pilum is young. Here’s what it doesn’t do well yet:
No built-in rollback: If a deployment succeeds but causes issues, you
need to manually deploy the previous tag. We’re working on pilum rollback.
Limited to sequential steps: All services must complete step N before any service can start step N+1. For some workflows (independent services), you’d want fully parallel execution. This is a design trade-off for simplicity.
Recipe changes require Pilum updates: If you want to modify a built-in recipe, you need to fork Pilum or wait for a new release. We’re considering a way to override recipes locally.
No secret management: Pilum assumes your cloud CLI is already authenticated. It doesn’t handle secrets, credentials, or environment variable management. Use your existing secret management solution.
Limited observability: No built-in dashboard, no metrics export, no Slack/email notifications on completion. It’s a CLI tool that outputs to stdout. For production monitoring, you’ll need to wrap it.
These aren’t deal-breakers for our use case (small team, fast iteration). They might be for yours. Feedback welcome.
When NOT to Use Pilum
Single service, single platform: If you’re deploying one Go service to Kubernetes, use ko. If you’re deploying one container to Cloud Run, use gcloud run deploy. Pilum’s value is orchestration across multiple services and platforms.
Kubernetes-native workflows: If your entire stack is Kubernetes and you want GitOps, use ArgoCD or Flux. Pilum doesn’t manage Kubernetes manifests or do continuous reconciliation.
Complex build pipelines: If you need conditional builds, matrix builds, or artifact caching beyond Docker layer caching, use Earthly or Bazel. Pilum’s build step is intentionally simple.
You need rollback automation: Pilum doesn’t yet support automatic rollback. If this is critical, you’ll need to wrap Pilum or wait for the feature.
Try It, Break It, Contribute
Pilum is young (v0.2.0) but deployed in production at SID. We’re using it to deploy 19 services across GCP Cloud Run and Homebrew.
If you’re deploying multi-service systems across multiple platforms, give it a try:
brew tap sid-technologies/pilum
brew install pilum
pilum init # Creates sample service.yaml
pilum check # Validates configuration
pilum deploy --dry-run --tag=v1.0.0
If you hit rough edges, we want to know:
- GitHub Issues: github.com/SID-Technologies/Pilum/issues
- Discussions: github.com/SID-Technologies/Pilum/discussions
If you want a provider we don’t support yet:
- Check open provider requests
- Or contribute a recipe (see CONTRIBUTING.md)
The recipe system is designed for extensibility. We’re betting that the right abstraction—declarative recipes, pluggable handlers, parallel execution—can generalize across deployment targets.
If we’re right, Pilum becomes a shared deployment layer for the ecosystem. If we’re wrong, we’ll learn something and iterate.
What’s Next
Current providers: GCP Cloud Run, Homebrew, AWS Lambda (in progress).
On the roadmap:
- AWS ECS
- GitHub Releases integration
- AWS Lambda
- Parallel recipe discovery across monorepos
The code is on GitHub. The landing page is at pilum.dev.
The code is open. Fork it, extend it, tell us what breaks.
Links:
GitHub: github.com/SID-Technologies/Pilum Landing page: pilum.dev