Note: This document describes the architecture of
lifecycle, spanning its v1.0-v1.4 Foundation (Death Management) and the v1.5+ Control Plane (Life Management). For a history of architectural choices, see DECISIONS.md.
This section defines the architectural pillars that govern the library.
Architecture Note (Facade Pattern): The root
lifecyclepackage acts as a Facade, exposing a curated subset of functionality frompkg/coreandpkg/eventsfor 90% of use cases. Deep consumers should import from the core packages directly, while application authors should prefer the root package for ergonomics. Tests inlifecycle_test.goverify this wiring but do not duplicate the exhaustive behavioral tests found in the core packages.
Technically, lifecycle is a Signal-Aware Control Plane and Interruptible I/O Supervisor for modern applications (Services, Agents, CLIs).
SIGINT) and “System Demands” (SIGTERM), enabling intelligent shutdown policies (e.g., “Press Ctrl+C again to force quit”).read), allowing them to be abandoned instantly via Context cancellation, preventing goroutine leaks.To prevent “Memory Leaks” and “Zombie Processes”, the system imposes explicit constraints:
We acknowledge that OS signals are inherently global. Instead of pretending they aren’t, lifecycle manages this global state for you.
net/http, we provide a default multiplexer for ease of use.Context propagation and Handler interfaces.We adopt a Fail-Closed default for child processes.
If the parent process crashes or is killed (SIGKILL), all child processes must die immediately. This is enforced via OS primitives on supported platforms:
SysProcAttr.PdeathsigJOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE)Windows is a first-class citizen.
CONIN$ to ensure Ctrl+C works reliably in interactive prompts.Internal state changes are not black boxes. They are exposed via:
State() methods that allow the application to visualize its own topology.The lifecycle is bound to the Main Job (lifecycle.Run(fn)). When the main function returns, the application is considered Complete. lifecycle automatically cancels the Global Context, signaling all background tasks (lifecycle.Go, Supervisor) to shut down immediately. This prevents “Orphaned Processes” where a finished CLI tool hangs indefinitely waiting for a metrics reporter.
We believe in Simple Primitives, Rich Behaviors. Instead of a monolithic “Exit” function with 20 flags, we provide atomic events (Suspend, Resume, Shutdown, Reload) that can be chained.
x or terminate) is simply a sequence: SuspendEvent (to ensure state is saved) followed by a ShutdownEvent.context.Canceled is not an “Error” to be warned about. It is the sign of a healthy, responding system fulfilling its contract.This section details the internal state machines and I/O handling strategies.
Our SignalContext manages the transition from Graceful to Forced shutdown based on a configurable Force-Exit Threshold.
stateDiagram-v2
[*] --> Running
Running --> Graceful: SIGTERM (1st) or SIGINT (Count == Threshold)
note right of Graceful
Context cancelled.
App starts cleanup.
end note
Graceful --> ForceExit: Any Signal (Count > Threshold)
note right of ForceExit
os.Exit(1) called.
Immediate termination.
end note
Running --> Running: SIGINT (Escalation Mode Threshold >= 2)
note left of Running
Count < Threshold:
ClearLineEvent emitted.
end note
ForceExit --> [*]
Graceful --> [*]: Natural Cleanup Completes
Key Behaviors:
SIGINT (Ctrl+C) or SIGTERM cancels the context. The second signal triggers os.Exit(1). This is the default.SIGINT is captured and emitted as an event (InterruptEvent) without cancelling the context. Only the N-th signal triggers os.Exit(1). SIGTERM always cancels on the first signal.WithCancelOnInterrupt(false) is set, the runtime implicitly increments the threshold by 2. This preserves the “Distance Invariant” (Kill distance relative to the last software action) and prevents races during interactive shutdowns.OnShutdown hooks run concurrently or sequentially (LIFO) depending on configuration, but always after context cancellation.ctx.Reason() differentiates if closure was manual (Stop()), signal-based (Interrupt), or time-based (Timeout).WithShutdownTimeout (default 2s), the runtime automatically dumps all goroutine stacks to stderr to help diagnose hangs.sequenceDiagram
participant OS
participant SignalContext
participant Hook_B
participant Hook_A
participant App
OS->>SignalContext: SIGTERM
SignalContext->>App: Cancel Context (ctx.Done closed)
rect rgb(30, 30, 30)
note right of SignalContext: Async Cleanup (LIFO)
SignalContext->>Hook_B: Execute()
Hook_B-->>SignalContext: Return
SignalContext->>Hook_A: Execute()
Hook_A-->>SignalContext: Return (or Panic recovered)
end
Traditional I/O is binary: it reads or blocks. lifecycle (via procio/termio) introduces Context-Aware I/O to balance Data vs. Safety.
| Strategy | Use Case | Behavior |
|---|---|---|
| Shielded Return | Automation / Logs | Data First. If data arrives with Cancel, return Data. |
| Strict Discard | Interactive Prompts | Safety First. If Cancel occurs, discard partial input. |
| Regret Window | Critical Opps | Pause. Sleep(ctx) breaks availability on Cancel. |
sequenceDiagram
participant App
participant Reader
participant OS_Stdin
participant Context
note over App: Strategy Selection
alt Strategy A (Data First)
App->>Reader: Read()
OS_Stdin-->>Reader: Returns "Data"
Context-->>Reader: Returns "Cancelled"
Reader-->>App: Return "Data", nil
note right of App: Process Data
else Strategy B (Error First)
App->>Reader: ReadInteractive()
OS_Stdin-->>Reader: Returns "Data"
Context-->>Reader: Returns "Cancelled"
Reader-->>App: Return 0, ErrInterrupted
note right of App: Abort Operation (Strict)
else Strategy C (Regret Window)
App->>App: Input Accepted
App->>lifecycle: Sleep(ctx, 3s)
Context-->>lifecycle: Cancelled (User Regret)
lifecycle-->>App: Return ctx.Err()
note right of App: Abort Execution
end
lifecycle provides primitives to manage goroutines safely, ensuring they respect shutdown signals and provide visibility.
lifecycle.Go)The most common pattern. Fire-and-forget but tracked.
lifecycle.Run automatically waits for these tasks.lifecycle.Run(func(ctx context.Context) error {
lifecycle.Go(ctx, func(ctx context.Context) error {
// Runs in background, but tracked.
// If it panics, app stays alive.
return nil
})
return nil
})
lifecycle.Do)Executes a function synchronously with safety guarantees.
Go and Group.lifecycle.Group)For complex parallelism requiring limits or gang-scheduling.
errgroup.Group.SetLimit(n), panic recovery, and metric tracking.g, ctx := lifecycle.NewGroup(ctx)
g.SetLimit(10)
g.Go(func(ctx context.Context) error { ... })
g.Wait()
To ensure safe access to shared worker state, we use the withLock and withLockResult helpers:
value := withLockResult(p, func() int { return p.myField })
withLock(p, func() { p.myField = 42 })
Attention: Do not use these helpers in methods that already perform locking internally (e.g., ExportState), to avoid deadlocks.
This pattern reduces boilerplate, prevents improper unlocks, and simplifies maintenance.
See the formal decision in ADR05 in DECISIONS.md.
procio)Ensures child processes do not outlive the parent. This logic is delegated to the procio library.
SysProcAttr.Pdeathsig to signal the child when the parent thread dies.JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE) to ensure the OS terminates the child tree when the parent handle is closed.exec.Cmd (OS limitations prevent strict guarantees).To support Durable Execution engines (like Trellis), we provide primitives that shield critical operations.
lifecycle.DoDetached)lifecycle.DoDetached(ctx, fn) (formerly Do) allows executing a function that cannot be cancelled by the parent context until it completes. It returns any error produced by the shielded function.
Note:
lifecycle.Do(ctx, fn)now represents a “Safe Executor” that respects cancellation but provides panic recovery and observability.DoDetachedwrapsDowithcontext.WithoutCancel.
sequenceDiagram
participant P as Parent Context
participant D as lifecycle.DoDetached
participant F as Function
P->>D: Call DoDetached(ctx, fn)
D->>F: Run fn(shieldedCtx) -> error
note right of P: User hits Ctrl+C
P--xP: Cancelled!
note over D: DoDetached detects cancellation<br/>but WAITS for fn
F->>F: Complete Critical Work
F-->>D: Return error
D-->>P: Return error (or Canceled if shielded ctx ignored)
(Introduced in v1.3) The Supervisor manages a set of Workers, forming a Supervision Tree.
Uniform interface for Process, Container, and Goroutine management.
sequenceDiagram
participant Manager
participant Worker
Manager->>Worker: Start(ctx)
activate Worker
rect rgb(30, 30, 30)
note right of Worker: Work happens...
end
alt Graceful Stop
Manager->>Worker: Stop(ctx)
Worker-->>Manager: Returns nil
else Crash
Worker->>Worker: Closes Wait() channel (w/ error)
end
deactivate Worker
Asynchronous callbacks that send on channels can race with shutdown. To avoid panics and leaked goroutines, adopt a three-stage cleanup protocol:
closed = true).sync.WaitGroup and wait for completion.Use BlockWithTimeout to avoid indefinite waits during shutdown.
type debouncer struct {
closed bool
wg sync.WaitGroup
out chan Event
}
func (d *debouncer) stopAndWait(timeout time.Duration) {
d.closed = true
done := make(chan struct{})
go func() {
d.wg.Wait()
close(done)
}()
_ = lifecycle.BlockWithTimeout(done, timeout)
close(d.out)
}
By default, an OS-level worker’s Stop method sends a signal and waits using the BaseWorker timeout logic. A FuncWorker stops and returns immediately upon context cancellation. However, Stop() calls may return before the child process’s stdout and stderr buffers have completely flushed, or a background goroutine has fully exited.
To address race conditions in heavily coordinated systems (like executing tools strictly sequentially), the library provides a universal utility: lifecycle.StopAndWait(ctx, worker).
It internally calls worker.Stop(ctx) but firmly blocks return until <-worker.Wait() completely resolves, ensuring all background I/O or detached routines are cleanly closed before yielding control back to the caller.
Always, OnFailure, or Never.MaxRestarts within MaxDuration).restarts: Total restart count for the child.circuit_breaker: Set to triggered if the max restarts threshold is exceeded.Supervisors can manage other supervisors, forming deep trees. The State() method supports recursive inspection by propagating Children fields:
rootSup := supervisor.New("root", supervisor.StrategyOneForOne,
supervisor.Spec{
Name: "child-sup",
Factory: func() (worker.Worker, error) {
return childSup, nil // Nested supervisor
},
},
)
state := rootSup.State()
// state.Children[0].Children = childSup's children ✅
This enables full topology visualization in introspection diagrams, showing the complete supervision tree regardless of nesting depth.
To prevent race conditions during rapid failures and restarts (e.g., OneForAll strategy), the supervisor implements a Worker Identity Shield.
childExit event carries a reference to the specific worker.Worker instance that triggered it. The supervisor’s monitor loop verifies this identity against the currently active child before taking action. Stale events are logged and ignored.Allows “Durable Execution” across restarts. The Supervisor injects environment variables into the restarted worker:
LIFECYCLE_RESUME_ID: Stable UUID for the worker session.LIFECYCLE_PREV_EXIT: Exit code of the previous run.sequenceDiagram
participant Sup as Supervisor
participant W as Worker (Instance 1)
participant W2 as Worker (Instance 2)
Sup->>W: Start (Injected: RESUME_ID=ABC, PREV_EXIT=0)
W-->>Sup: Crash!
note over Sup: Strategy OneForOne
Sup->>W2: Start (Injected: RESUME_ID=ABC, PREV_EXIT=-1)
note right of W2: Worker resumes work for session 'ABC'
(Introduced in v1.5) The Control Plane generalized the “Signal” concept into generic “Events”.
The Router is the central nervous system of the Control Plane, inspired by net/http.ServeMux. It routes generalized Events to specialized Handlers.
Note (Facade): The router and handlers are exposed via the top-level
lifecyclepackage for ease of use (e.g.,lifecycle.NewRouter()).
Routes are defined using string patterns:
"webhook/reload" (O(1) map lookup)"signal.*" (O(n) linear search using path.Match)router.HandleFunc("signal.*", func(ctx context.Context, e Event) error {
log.Println("Received signal:", e)
return nil
})
Pattern Syntax & Performance: For detailed pattern syntax, performance benchmarks (scaling with route count), and examples, see LIMITATIONS.md - Router Pattern Matching and
pkg/events/router_benchmark_test.go.
The library provides predefined events for common lifecycle transitions:
| Event | Topic | Trigger | Typical Action |
|---|---|---|---|
SuspendEvent |
lifecycle/suspend |
Escalation logic / API | Pause workers, persist state. |
ResumeEvent |
lifecycle/resume |
Escalation logic / API | Resume workers from state. |
ShutdownEvent |
lifecycle/shutdown |
Input: exit, quit |
Cancel SignalContext. |
TerminateEvent |
lifecycle/terminate |
Input: x, terminate |
Suspend (Save) + Shutdown. |
ClearLineEvent |
lifecycle/clear-line |
Ctrl+C Escalation Mode | Clear CLI prompt, re-print >. |
UnknownCommandEvent |
input/unknown |
Input: ?, unknown |
Print generic help message. |
Middleware wraps handlers to provide cross-cutting concerns (logging, recovery, tracing).
router.Use(RecoveryMiddleware)
router.Use(LoggingMiddleware)
Events might be triggered multiple times (e.g., a “Quit” signal followed by a manual “Exit” command). To prevent side-effects like double-closing channels, lifecycle provides the control.Once(handler) middleware.
This utility ensures the wrapped handler’s logic is executed exactly once, providing a standard safety mechanism for shutdown and cleanup operations.
// Protected shutdown logic
quitHandler := control.Once(control.HandlerFunc(func(ctx context.Context, _ control.Event) error {
close(quitCh) // Safe against multiple calls
return nil
}))
The Router exposes registered routes and its own status via the Introspectable interface.
type Introspectable interface {
State() any
}
Calls to State() return a snapshot of the component’s internal state (topology, metrics, flags) for visualization tools.
state := router.State().(RouterState)
// {Routes: [...], Middlewares: 2, Running: true}
To support Durable Execution systems, lifecycle introduces SuspendEvent and ResumeEvent managed by handlers.SuspendHandler.
stateDiagram-v2
[*] --> Running
Running --> Suspended: SuspendEvent
Suspended --> Running: ResumeEvent
Running --> Graceful: SIGTERM
Suspended --> Graceful: SIGTERM
SuspendHandler uses an internal transitioning flag to ensure that while hooks are running, duplicate events (e.g., rapid-fire suspend signals) are ignored. This prevents race conditions and ensures hook execution is atomic.Suspend while already suspended (or Resume while running) are ignored safely.SuspendGate primitive is context-aware, ensuring buffered workers can abort instantly if the application shuts down while they are paused.The SuspendHandler (and most control.Router logic) executes hooks sequentially in the order they were registered. This has critical implications for UI feedback:
Blocker), the UI will announce success immediately, while the system is still technically in transition.// Correct Order:
suspendHandler.Manage(supervisor) // 1. Heavy lifting (blocking)
suspendHandler.OnSuspend(func(...) { // 2. UI feedback (runs after #1)
fmt.Println("🛑 SYSTEM SUSPENDED")
})
sequenceDiagram
participant S as Source (OS/HTTP)
participant R as Router
participant M as Middleware
participant H as Handler
S->>R: Emit(Event)
R->>R: Match(Event.Topic)
R->>M: Dispatch(Event)
M->>H: Handle(Event)
H-->>M: Return error?
M-->>R: Complete
To reduce boilerplate for CLI applications, lifecycle provides a pre-configured router helper.
// wires up:
// - OS Signals (Interrupt/Term) -> Escalator (Interrupt first, then Quit)
// - Input (Stdin) -> Router (reads lines as commands)
// - Commands: "suspend", "resume" -> SuspendHandler
// - Command: "quit", "q" -> shutdownFunc
router := lifecycle.NewInteractiveRouter(
lifecycle.WithSuspendOnInterrupt(suspendHandler),
lifecycle.WithShutdown(func() { ... }),
)
This helper ensures standard behavior (“q” to quit, “Ctrl+C” to suspend first) without manual wiring.
To reduce boilerplate across source implementations, lifecycle provides BaseSource — an embeddable helper following the same pattern as BaseWorker.
Problem: 7 source types repeated identical Events() method implementation.
Solution: Embedding pattern with auto-exposed methods.
Before (per source):
type MySource struct {
events chan control.Event // Repeated
}
func NewMySource() *MySource {
return &MySource{
events: make(chan control.Event, 10), // Repeated
}
}
func (s *MySource) Events() <-chan control.Event { // Repeated
return s.events
}
func (s *MySource) Start(ctx context.Context) error {
s.events <- event // Direct access
}
After (with BaseSource):
type MySource struct {
control.BaseSource // Embedding!
}
func NewMySource() *MySource {
return &MySource{
BaseSource: control.NewBaseSource(10), // Explicit buffer
}
}
// Events() FREE via embedding!
func (s *MySource) Start(ctx context.Context) error {
s.Emit(event) // Clean helper
}
Benefits:
API:
func NewBaseSource(bufferSize int) BaseSource
func (b *BaseSource) Events() <-chan Event // Auto via embedding
func (b *BaseSource) Emit(e Event) // Helper
func (b *BaseSource) Close() // Cleanup
Usage: FileWatchSource, WebhookSource, TickerSource, InputSource, HealthCheckSource, ChannelSource, OSSignalSource.
High-frequency event sources (like recursive filesystem watchers) can overwhelm the system. The Control Plane provides events.DebounceHandler to buffer bursts and emit a single, stable event after a quiet window (trailing edge).
WithMaxWait option guarantees a synchronous payload flush after a specified maximum duration.mergeFunc to combine arriving events (e.g., accumulating changed file paths) rather than blindly dropping them.While the default Router uses a callback-based Handler interface, some Go applications prefer idiomatic select loops or range iterations over channels.
The events.Notify(ch) bridge converts a standard Go channel into a Handler. It performs non-blocking sends, dropping events cleanly (ErrNotHandled) if the consumer’s channel buffer is full, preventing the Control Plane from deadlocking due to a slow reader.
// Allows idiomatic integration with other select loops
ch := make(chan events.Event, 100)
router.Handle("file/*", events.Notify(ch))
lifecycle.Go)To adhere to Zero Config but safe concurrency, we use Context Propagation.
// 1. Run injects a TaskTracker into the context
runtime.Run(func(ctx context.Context) error {
// 2. Go() uses the tracker from the context
runtime.Go(ctx, func(ctx context.Context) error {
// ... safe background work ...
return nil
})
return nil
})
Features:
Go looks for a tracker in ctx. If found, it tracks the goroutine via lifecycle.Run.Run is not used, Go falls back to a global tracker. You can wait for these tasks with lifecycle.WaitForGlobal().Run() waits for all tracked goroutines to finish before exiting.lifecycle adopts the Introspection Pattern: components expose State() methods returning immutable DTOs, which are rendered into Mermaid diagrams via the github.com/aretw0/introspection library.
Visualization is delegated to the external introspection library, following the same Primitive Promotion strategy used for procio (see ADR-0010 and ADR-0011).
lifecycle provides domain-specific styling logic (signal.PrimaryStyler, worker.NodeLabeler) and configuration via LifecycleDiagramConfig().introspection handles structural rendering (Mermaid syntax, graph traversal, CSS class application).This separation ensures that:
signal, worker, and supervisor packages.trellis, arbour) can use introspection for their own topologies.introspection.StateMachineDiagram as stateDiagram-v2.introspection.TreeDiagram or introspection.ComponentDiagram as graph TD.The lifecycle.SystemDiagram(sig, work) function synthesizes the Control Plane (Signal Context) and Data Plane (Worker Tree) into a single Mermaid diagram:
diagram := lifecycle.SystemDiagram(ctx.State(), supervisor.State())
This delegates to introspection.ComponentDiagram, which applies the configuration from LifecycleDiagramConfig().
The following CSS classes are applied by stylers to represent component states:
These classes are defined in the domain packages (pkg/core/signal, pkg/core/worker) and consumed by introspection via the NodeStyler and PrimaryStyler hooks.
For implementation details, see docs/ecosystem/introspection.md.
The library is instrumented via pkg/metrics, pkg/log, and the optional Observer hook.
IncSignalReceivedIncProcessStarted, IncProcessFailedObserveHookDurationIncTerminalUpgrade (Windows CONIN$ usage)When a background task panics (lifecycle.Go), the runtime invokes:
Observer.OnGoroutinePanicked(recovered, stack)Stack capture is controlled by WithStackCapture(bool):
Configuration and ObserverBridge examples live in docs/CONFIGURATION.md.
For a comprehensive list of platform-specific constraints, API stability status, performance unknowns, and compatibility matrices, see LIMITATIONS.md.
Key Highlights:
lifecycle.Go() call; stack capture adds +1-2µs if enabledmetrics, termio (external), and flaky FS code; see TESTING.mdlifecycle adopts an “Honest Coverage” baseline. Instead of pursuing an arbitrary 100% or even 80% statement coverage across every line of code, we prioritize the verification of Behavioral Logic and Critical Path Resilience.
We distinguish between two types of code:
In certain packages, “100% coverage” often indicates Test Theater—tests that exercise no-op paths or force unreachable error states (like mock syscall failures) just to satisfy a metric.
We consider a package “Satisfactory” even with lower metrics if the missing coverage falls into these categories:
NoOpProvider or LogProvider that exist for interface compatibility and possess no complex logic.By setting an Honest Baseline, we ensure that our engineering efforts are spent on validating the reliability of the system, not on maintaining the theater of perfect metrics.