lifecycle

Technical Architecture

Note: This document describes the architecture of lifecycle, spanning its v1.0-v1.4 Foundation (Death Management) and the v1.5+ Control Plane (Life Management). For a history of architectural choices, see DECISIONS.md.

Table of Contents


I. The Bedrock (v1.0-v1.4 Foundation)

This section defines the architectural pillars that govern the library.

Architecture Note (Facade Pattern): The root lifecycle package acts as a Facade, exposing a curated subset of functionality from pkg/core and pkg/events for 90% of use cases. Deep consumers should import from the core packages directly, while application authors should prefer the root package for ergonomics. Tests in lifecycle_test.go verify this wiring but do not duplicate the exhaustive behavioral tests found in the core packages.

1. Formal Definition (Identity)

Technically, lifecycle is a Signal-Aware Control Plane and Interruptible I/O Supervisor for modern applications (Services, Agents, CLIs).

2. Design Principles (Constraints)

To prevent “Memory Leaks” and “Zombie Processes”, the system imposes explicit constraints:

2.1. Managed Global State

We acknowledge that OS signals are inherently global. Instead of pretending they aren’t, lifecycle manages this global state for you.

2.2. Fail-Closed Hygiene

We adopt a Fail-Closed default for child processes. If the parent process crashes or is killed (SIGKILL), all child processes must die immediately. This is enforced via OS primitives on supported platforms:

2.3. Platform Agnosticism (Windows First)

Windows is a first-class citizen.

2.4. Observability by Default

Internal state changes are not black boxes. They are exposed via:

2.5. Main-Driven Shutdown

The lifecycle is bound to the Main Job (lifecycle.Run(fn)). When the main function returns, the application is considered Complete. lifecycle automatically cancels the Global Context, signaling all background tasks (lifecycle.Go, Supervisor) to shut down immediately. This prevents “Orphaned Processes” where a finished CLI tool hangs indefinitely waiting for a metrics reporter.

2.6. Pragmatic Composition over Monoliths

We believe in Simple Primitives, Rich Behaviors. Instead of a monolithic “Exit” function with 20 flags, we provide atomic events (Suspend, Resume, Shutdown, Reload) that can be chained.


II. Core Mechanics (Death Management)

This section details the internal state machines and I/O handling strategies.

3. Signal State Machine

Our SignalContext manages the transition from Graceful to Forced shutdown based on a configurable Force-Exit Threshold.

stateDiagram-v2
    [*] --> Running
    
    Running --> Graceful: SIGTERM (1st) or SIGINT (Count == Threshold)
    note right of Graceful
        Context cancelled.
        App starts cleanup.
    end note

    Graceful --> ForceExit: Any Signal (Count > Threshold)
    note right of ForceExit
        os.Exit(1) called.
        Immediate termination.
    end note

    Running --> Running: SIGINT (Escalation Mode Threshold >= 2)
    note left of Running
        Count < Threshold:
        ClearLineEvent emitted.
    end note

    ForceExit --> [*]
    Graceful --> [*]: Natural Cleanup Completes

Key Behaviors:

Execution Flow

sequenceDiagram
    participant OS
    participant SignalContext
    participant Hook_B
    participant Hook_A
    participant App

    OS->>SignalContext: SIGTERM
    SignalContext->>App: Cancel Context (ctx.Done closed)
    
    rect rgb(30, 30, 30)
        note right of SignalContext: Async Cleanup (LIFO)
        SignalContext->>Hook_B: Execute()
        Hook_B-->>SignalContext: Return
        SignalContext->>Hook_A: Execute()
        Hook_A-->>SignalContext: Return (or Panic recovered)
    end

4. Context-Aware I/O & Safety

Traditional I/O is binary: it reads or blocks. lifecycle (via procio/termio) introduces Context-Aware I/O to balance Data vs. Safety.

Strategy Use Case Behavior
Shielded Return Automation / Logs Data First. If data arrives with Cancel, return Data.
Strict Discard Interactive Prompts Safety First. If Cancel occurs, discard partial input.
Regret Window Critical Opps Pause. Sleep(ctx) breaks availability on Cancel.
sequenceDiagram
    participant App
    participant Reader
    participant OS_Stdin
    participant Context

    note over App: Strategy Selection

    alt Strategy A (Data First)
        App->>Reader: Read()
        OS_Stdin-->>Reader: Returns "Data"
        Context-->>Reader: Returns "Cancelled"
        Reader-->>App: Return "Data", nil
        note right of App: Process Data
    else Strategy B (Error First)
        App->>Reader: ReadInteractive()
        OS_Stdin-->>Reader: Returns "Data"
        Context-->>Reader: Returns "Cancelled"
        Reader-->>App: Return 0, ErrInterrupted
        note right of App: Abort Operation (Strict)
    else Strategy C (Regret Window)
        App->>App: Input Accepted
        App->>lifecycle: Sleep(ctx, 3s)
        Context-->>lifecycle: Cancelled (User Regret)
        lifecycle-->>App: Return ctx.Err()
        note right of App: Abort Execution
    end

5. Managed Concurrency (v1.5)

lifecycle provides primitives to manage goroutines safely, ensuring they respect shutdown signals and provide visibility.

A. Scoped Execution (lifecycle.Go)

The most common pattern. Fire-and-forget but tracked.

lifecycle.Run(func(ctx context.Context) error {
    lifecycle.Go(ctx, func(ctx context.Context) error {
        // Runs in background, but tracked.
        // If it panics, app stays alive.
        return nil
    })
    return nil
})

B. Safe Executor (lifecycle.Do)

Executes a function synchronously with safety guarantees.

C. Structured Group (lifecycle.Group)

For complex parallelism requiring limits or gang-scheduling.

g, ctx := lifecycle.NewGroup(ctx)
g.SetLimit(10)
g.Go(func(ctx context.Context) error { ... })
g.Wait()

D. Synchronization with Mutex

To ensure safe access to shared worker state, we use the withLock and withLockResult helpers:

value := withLockResult(p, func() int { return p.myField })
withLock(p, func() { p.myField = 42 })

Attention: Do not use these helpers in methods that already perform locking internally (e.g., ExportState), to avoid deadlocks.

This pattern reduces boilerplate, prevents improper unlocks, and simplifies maintenance.

See the formal decision in ADR05 in DECISIONS.md.

6. Process Hygiene (Powered by procio)

Ensures child processes do not outlive the parent. This logic is delegated to the procio library.

7. Reliability Primitives (v1.4)

To support Durable Execution engines (like Trellis), we provide primitives that shield critical operations.

Critical Sections (lifecycle.DoDetached)

lifecycle.DoDetached(ctx, fn) (formerly Do) allows executing a function that cannot be cancelled by the parent context until it completes. It returns any error produced by the shielded function.

Note: lifecycle.Do(ctx, fn) now represents a “Safe Executor” that respects cancellation but provides panic recovery and observability. DoDetached wraps Do with context.WithoutCancel.

 sequenceDiagram
     participant P as Parent Context
     participant D as lifecycle.DoDetached
     participant F as Function
     
     P->>D: Call DoDetached(ctx, fn)
     D->>F: Run fn(shieldedCtx) -> error
     
     note right of P: User hits Ctrl+C
     P--xP: Cancelled!
     
     note over D: DoDetached detects cancellation<br/>but WAITS for fn
     
     F->>F: Complete Critical Work
     F-->>D: Return error
     
     D-->>P: Return error (or Canceled if shielded ctx ignored)

III. The Supervisor Pattern (The Bridge)

(Introduced in v1.3) The Supervisor manages a set of Workers, forming a Supervision Tree.

8. Worker Protocol

Uniform interface for Process, Container, and Goroutine management.

sequenceDiagram
    participant Manager
    participant Worker
    
    Manager->>Worker: Start(ctx)
    activate Worker
    
    rect rgb(30, 30, 30)
        note right of Worker: Work happens...
    end

    alt Graceful Stop
        Manager->>Worker: Stop(ctx)
        Worker-->>Manager: Returns nil
    else Crash
        Worker->>Worker: Closes Wait() channel (w/ error)
    end
    deactivate Worker

8.1. Protected Resource Cleanup Pattern (STOP / WAIT / CLOSE)

Asynchronous callbacks that send on channels can race with shutdown. To avoid panics and leaked goroutines, adopt a three-stage cleanup protocol:

  1. STOP: Reject new work (e.g., set closed = true).
  2. WAIT: Track in-flight callbacks via sync.WaitGroup and wait for completion.
  3. CLOSE: Close channels or release resources only after callbacks are drained.

Use BlockWithTimeout to avoid indefinite waits during shutdown.

type debouncer struct {
    closed  bool
    wg      sync.WaitGroup
    out     chan Event
}

func (d *debouncer) stopAndWait(timeout time.Duration) {
    d.closed = true

    done := make(chan struct{})
    go func() {
        d.wg.Wait()
        close(done)
    }()

    _ = lifecycle.BlockWithTimeout(done, timeout)
    close(d.out)
}

8.2. Synchronous Worker Shutdown (StopAndWait)

By default, an OS-level worker’s Stop method sends a signal and waits using the BaseWorker timeout logic. A FuncWorker stops and returns immediately upon context cancellation. However, Stop() calls may return before the child process’s stdout and stderr buffers have completely flushed, or a background goroutine has fully exited.

To address race conditions in heavily coordinated systems (like executing tools strictly sequentially), the library provides a universal utility: lifecycle.StopAndWait(ctx, worker).

It internally calls worker.Stop(ctx) but firmly blocks return until <-worker.Wait() completely resolves, ensuring all background I/O or detached routines are cleanly closed before yielding control back to the caller.

9. Supervision Tree

Recursive Introspection

Supervisors can manage other supervisors, forming deep trees. The State() method supports recursive inspection by propagating Children fields:

rootSup := supervisor.New("root", supervisor.StrategyOneForOne,
    supervisor.Spec{
        Name: "child-sup",
        Factory: func() (worker.Worker, error) {
            return childSup, nil  // Nested supervisor
        },
    },
)

state := rootSup.State()
// state.Children[0].Children = childSup's children ✅

This enables full topology visualization in introspection diagrams, showing the complete supervision tree regardless of nesting depth.

Worker Identity Shield (Reliability)

To prevent race conditions during rapid failures and restarts (e.g., OneForAll strategy), the supervisor implements a Worker Identity Shield.

10. Handover Protocol

Allows “Durable Execution” across restarts. The Supervisor injects environment variables into the restarted worker:

sequenceDiagram
    participant Sup as Supervisor
    participant W as Worker (Instance 1)
    participant W2 as Worker (Instance 2)
    
    Sup->>W: Start (Injected: RESUME_ID=ABC, PREV_EXIT=0)
    W-->>Sup: Crash!
    
    note over Sup: Strategy OneForOne
    
    Sup->>W2: Start (Injected: RESUME_ID=ABC, PREV_EXIT=-1)
    note right of W2: Worker resumes work for session 'ABC'

IV. The Control Plane (v1.5+)

(Introduced in v1.5) The Control Plane generalized the “Signal” concept into generic “Events”.

11. Event Router (Source -> Handler)

The Router is the central nervous system of the Control Plane, inspired by net/http.ServeMux. It routes generalized Events to specialized Handlers.

Note (Facade): The router and handlers are exposed via the top-level lifecycle package for ease of use (e.g., lifecycle.NewRouter()).

11.1. Mux-Style Pattern Matching

Routes are defined using string patterns:

router.HandleFunc("signal.*", func(ctx context.Context, e Event) error {
    log.Println("Received signal:", e)
    return nil
})

Pattern Syntax & Performance: For detailed pattern syntax, performance benchmarks (scaling with route count), and examples, see LIMITATIONS.md - Router Pattern Matching and pkg/events/router_benchmark_test.go.

11.2. Standard Events (Control Plane)

The library provides predefined events for common lifecycle transitions:

Event Topic Trigger Typical Action
SuspendEvent lifecycle/suspend Escalation logic / API Pause workers, persist state.
ResumeEvent lifecycle/resume Escalation logic / API Resume workers from state.
ShutdownEvent lifecycle/shutdown Input: exit, quit Cancel SignalContext.
TerminateEvent lifecycle/terminate Input: x, terminate Suspend (Save) + Shutdown.
ClearLineEvent lifecycle/clear-line Ctrl+C Escalation Mode Clear CLI prompt, re-print >.
UnknownCommandEvent input/unknown Input: ?, unknown Print generic help message.

11.3. Middleware Chains

Middleware wraps handlers to provide cross-cutting concerns (logging, recovery, tracing).

router.Use(RecoveryMiddleware)
router.Use(LoggingMiddleware)

11.4. Idempotent Handlers (Once)

Events might be triggered multiple times (e.g., a “Quit” signal followed by a manual “Exit” command). To prevent side-effects like double-closing channels, lifecycle provides the control.Once(handler) middleware.

This utility ensures the wrapped handler’s logic is executed exactly once, providing a standard safety mechanism for shutdown and cleanup operations.

// Protected shutdown logic
quitHandler := control.Once(control.HandlerFunc(func(ctx context.Context, _ control.Event) error {
    close(quitCh) // Safe against multiple calls
    return nil
}))

11.5. Introspection

The Router exposes registered routes and its own status via the Introspectable interface.

type Introspectable interface {
    State() any
}

Calls to State() return a snapshot of the component’s internal state (topology, metrics, flags) for visualization tools.

state := router.State().(RouterState)
// {Routes: [...], Middlewares: 2, Running: true}

11.6. Suspend & Resume (Durable Execution)

To support Durable Execution systems, lifecycle introduces SuspendEvent and ResumeEvent managed by handlers.SuspendHandler.

stateDiagram-v2
    [*] --> Running
    Running --> Suspended: SuspendEvent
    Suspended --> Running: ResumeEvent
    Running --> Graceful: SIGTERM
    Suspended --> Graceful: SIGTERM

11.6.1. Sequential Execution & Ordering (FIFO)

The SuspendHandler (and most control.Router logic) executes hooks sequentially in the order they were registered. This has critical implications for UI feedback:

// Correct Order:
suspendHandler.Manage(supervisor)              // 1. Heavy lifting (blocking)
suspendHandler.OnSuspend(func(...) {           // 2. UI feedback (runs after #1)
    fmt.Println("🛑 SYSTEM SUSPENDED")
})

11.7. Execution Flow

sequenceDiagram
    participant S as Source (OS/HTTP)
    participant R as Router
    participant M as Middleware
    participant H as Handler

    S->>R: Emit(Event)
    R->>R: Match(Event.Topic)
    R->>M: Dispatch(Event)
    M->>H: Handle(Event)
    H-->>M: Return error?
    M-->>R: Complete

11.8. Interactive Router Preset

To reduce boilerplate for CLI applications, lifecycle provides a pre-configured router helper.

// wires up:
// - OS Signals (Interrupt/Term) -> Escalator (Interrupt first, then Quit)
// - Input (Stdin) -> Router (reads lines as commands)
// - Commands: "suspend", "resume" -> SuspendHandler
// - Command: "quit", "q" -> shutdownFunc
router := lifecycle.NewInteractiveRouter(
    lifecycle.WithSuspendOnInterrupt(suspendHandler),
    lifecycle.WithShutdown(func() { ... }),
)

This helper ensures standard behavior (“q” to quit, “Ctrl+C” to suspend first) without manual wiring.

11.9. Source Helper Pattern (BaseSource)

To reduce boilerplate across source implementations, lifecycle provides BaseSource — an embeddable helper following the same pattern as BaseWorker.

Problem: 7 source types repeated identical Events() method implementation.

Solution: Embedding pattern with auto-exposed methods.

Before (per source):

type MySource struct {
    events chan control.Event  // Repeated
}

func NewMySource() *MySource {
    return &MySource{
        events: make(chan control.Event, 10),  // Repeated
    }
}

func (s *MySource) Events() <-chan control.Event {  // Repeated
    return s.events
}

func (s *MySource) Start(ctx context.Context) error {
    s.events <- event  // Direct access
}

After (with BaseSource):

type MySource struct {
    control.BaseSource  // Embedding!
}

func NewMySource() *MySource {
    return &MySource{
        BaseSource: control.NewBaseSource(10),  // Explicit buffer
    }
}

// Events() FREE via embedding!

func (s *MySource) Start(ctx context.Context) error {
    s.Emit(event)  // Clean helper
}

Benefits:

API:

func NewBaseSource(bufferSize int) BaseSource
func (b *BaseSource) Events() <-chan Event  // Auto via embedding
func (b *BaseSource) Emit(e Event)           // Helper
func (b *BaseSource) Close()                 // Cleanup

Usage: FileWatchSource, WebhookSource, TickerSource, InputSource, HealthCheckSource, ChannelSource, OSSignalSource.

11.10. Event Conditioning & Throttling (Debounce)

High-frequency event sources (like recursive filesystem watchers) can overwhelm the system. The Control Plane provides events.DebounceHandler to buffer bursts and emit a single, stable event after a quiet window (trailing edge).

11.11. Channel Subscriptions (Pub/Sub)

While the default Router uses a callback-based Handler interface, some Go applications prefer idiomatic select loops or range iterations over channels.

The events.Notify(ch) bridge converts a standard Go channel into a Handler. It performs non-blocking sends, dropping events cleanly (ErrNotHandled) if the consumer’s channel buffer is full, preventing the Control Plane from deadlocking due to a slow reader.

// Allows idiomatic integration with other select loops
ch := make(chan events.Event, 100)
router.Handle("file/*", events.Notify(ch))

12. Managed Concurrency (lifecycle.Go)

To adhere to Zero Config but safe concurrency, we use Context Propagation.

// 1. Run injects a TaskTracker into the context
runtime.Run(func(ctx context.Context) error {
    // 2. Go() uses the tracker from the context
    runtime.Go(ctx, func(ctx context.Context) error {
        // ... safe background work ...
        return nil
    })
    return nil
})

Features:


V. Ecosystem & Operations

13. Introspection & Visualization

lifecycle adopts the Introspection Pattern: components expose State() methods returning immutable DTOs, which are rendered into Mermaid diagrams via the github.com/aretw0/introspection library.

Architecture: Separation of Concerns

Visualization is delegated to the external introspection library, following the same Primitive Promotion strategy used for procio (see ADR-0010 and ADR-0011).

This separation ensures that:

  1. Diagram logic is DRY: Rendering logic is not duplicated across signal, worker, and supervisor packages.
  2. Visualization is reusable: Other projects (e.g., trellis, arbour) can use introspection for their own topologies.
  3. Maintenance is centralized: Visual improvements or Mermaid syntax changes happen in one place.

Diagram Types

Unified System Diagram

The lifecycle.SystemDiagram(sig, work) function synthesizes the Control Plane (Signal Context) and Data Plane (Worker Tree) into a single Mermaid diagram:

diagram := lifecycle.SystemDiagram(ctx.State(), supervisor.State())

This delegates to introspection.ComponentDiagram, which applies the configuration from LifecycleDiagramConfig().

Status Palette (CSS Classes)

The following CSS classes are applied by stylers to represent component states:

These classes are defined in the domain packages (pkg/core/signal, pkg/core/worker) and consumed by introspection via the NodeStyler and PrimaryStyler hooks.

For implementation details, see docs/ecosystem/introspection.md.

14. Observability

The library is instrumented via pkg/metrics, pkg/log, and the optional Observer hook.

Panic Reporting (Observer Hook)

When a background task panics (lifecycle.Go), the runtime invokes:

Stack capture is controlled by WithStackCapture(bool):

Configuration and ObserverBridge examples live in docs/CONFIGURATION.md.

15. Known Limitations

For a comprehensive list of platform-specific constraints, API stability status, performance unknowns, and compatibility matrices, see LIMITATIONS.md.

Key Highlights:

VI. Quality & Reliability

16. Honest Coverage Philosophy

lifecycle adopts an “Honest Coverage” baseline. Instead of pursuing an arbitrary 100% or even 80% statement coverage across every line of code, we prioritize the verification of Behavioral Logic and Critical Path Resilience.

We distinguish between two types of code:

17. Coverage Rigidity vs. Reality

In certain packages, “100% coverage” often indicates Test Theater—tests that exercise no-op paths or force unreachable error states (like mock syscall failures) just to satisfy a metric.

We consider a package “Satisfactory” even with lower metrics if the missing coverage falls into these categories:

By setting an Honest Baseline, we ensure that our engineering efforts are spent on validating the reliability of the system, not on maintaining the theater of perfect metrics.