Skip to content

ADR-019: Cascade Monitor Pattern

Critical Decision | 2025-06-20 | Accepted | Revised 2025-06-29

Problem Statement

The cascade workflow needs reliable triggering when upstream changes are merged into the fork_upstream branch. However, automatic triggering approaches created significant reliability and usability challenges that undermined the system's effectiveness and team adoption.

Context and Requirements

Automatic Triggering Challenges

GitHub Event Limitations: - pull_request_target events require workflows to exist on target branch (fork_upstream) - Pure mirror branches don't contain workflow files, breaking event-based triggering - Complex conditional logic needed to filter relevant events from noise - Poor error handling for edge cases and missed triggers

Team Control Requirements: - Teams want explicit control over integration timing, not automatic triggering - Need ability to batch multiple changes or time integrations appropriately - Desire for clear audit trails and progress tracking throughout cascade lifecycle - Requirements for reliable error recovery and missed trigger detection

Visibility and Reliability Issues: - Unreliable triggering due to workflow file availability constraints - Poor visibility into cascade progress and state management - Complex error handling scattered across multiple workflow conditions - No comprehensive tracking of issue lifecycle and resolution status

Architectural Requirements

Human-Centric Control: Explicit human decision points with clear instructions and visible progress tracking.

Reliable Safety Net: Automated detection and recovery of missed triggers without complex event dependencies.

Comprehensive State Management: Complete audit trail from sync detection through production deployment with issue-based coordination.

Decision

Implement a Human-Centric Cascade Pattern with intelligent monitor-based safety net:

graph TD
    A[Sync PR Merged] --> B[Human Reviews Changes]
    B --> C[Manual Cascade Trigger]
    C --> D[Cascade Starts]
    D --> E[Issue Tracking Updates]
    E --> F{Conflicts?}
    F -->|No| G[Integration Success]
    F -->|Yes| H[Conflict Resolution]
    H --> I[Human Required]
    I --> J[Recovery Signal]
    J --> K[Auto-Retry]

    L[Monitor Schedule] --> M[Check Missed Triggers]
    M --> N{Pending Changes?}
    N -->|Yes| O[Auto-Trigger Safety Net]
    N -->|No| P[Continue Monitoring]

    style A fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style C fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    style H fill:#fff3e0,stroke:#e65100,stroke-width:2px
    style O fill:#fce4ec,stroke:#c2185b,stroke-width:2px

Primary Path: Human-Controlled Integration

Manual Trigger Process

# Human-centric workflow pattern
human_control:
  sync_completion: Clear instructions provided in sync workflow completion
  review_step: Explicit human review of upstream changes required
  manual_trigger: Cascade Integration workflow triggered manually
  progress_tracking: Real-time updates through issue comments

Enhanced User Guidance

# Sync workflow completion instructions
next_steps:
  review: "🔍 Review the sync PR for breaking changes or conflicts"
  merge: "✅ Merge the PR when satisfied with changes"
  trigger: "🚀 Manually trigger 'Cascade Integration' workflow"
  monitor: "📊 Monitor cascade progress in Actions tab"

Safety Net: Monitor-Based Detection

Automated Missed-Trigger Detection

# cascade-monitor.yml safety net
monitor_schedule:
  frequency: Every 6 hours (safety net detection)
  manual_trigger: workflow_dispatch for health checking
  detection_logic: Git-based branch comparison for pending changes
  recovery_action: Automatic cascade triggering with issue tracking

Intelligent Branch Comparison

# Reliable git-based detection
detection_algorithm:
  comparison: git rev-list --count fork_integration..fork_upstream
  threshold: UPSTREAM_COMMITS > 0 indicates pending changes
  issue_lookup: Find tracking issue using label-based search
  auto_trigger: Cascade workflow triggered with issue context

Comprehensive Issue Lifecycle Tracking

State Management Through Labels

# Label-based state progression
lifecycle_states:
  initial: upstream-sync (sync completed, awaiting review)
  active: cascade-active (integration in progress)
  blocked: cascade-blocked (conflicts requiring resolution)
  failed: cascade-failed + human-required (needs intervention)
  ready: production-ready (PR created, ready for deployment)
  complete: Issue closed (changes deployed to production)

Real-Time Progress Updates

# Issue comment updates throughout cascade
progress_communication:
  start: "🚀 Cascade Integration Started - [timestamp]"
  conflicts: "🚨 Conflicts detected - manual resolution required"
  production: "🎯 Production PR Created - ready for final review"
  completion: "✅ Changes successfully deployed to production"

Implementation Strategy

Monitor Safety Net Architecture

Scheduled Detection System

# cascade-monitor.yml structure
name: Cascade Monitor
on:
  schedule:
    - cron: '0 */6 * * *'  # Safety net every 6 hours
  workflow_dispatch:        # Manual health checks

jobs:
  detect-missed-cascade:
    steps:
      - name: Check for missed triggers
        run: |
          # Check if fork_upstream has commits fork_integration lacks
          UPSTREAM_COMMITS=$(git rev-list --count origin/fork_integration..origin/fork_upstream)

          if [ "$UPSTREAM_COMMITS" -gt 0 ]; then
            # Find active tracking issue
            ISSUE_NUMBER=$(gh issue list \
              --label "upstream-sync" \
              --state open \
              --limit 1 \
              --json number \
              --jq '.[0].number // empty')

            if [ -n "$ISSUE_NUMBER" ]; then
              # Auto-trigger cascade as safety net
              gh workflow run "Cascade Integration" \
                --repo ${{ github.repository }} \
                -f issue_number="$ISSUE_NUMBER"
            fi
          fi

Automated Failure Recovery System

Self-Healing Recovery Pattern

# Recovery detection and automatic retry
recovery_system:
  detection: Issues with cascade-failed but NOT human-required
  label_transition: cascade-failed → cascade-active
  auto_retry: Workflow triggered automatically
  human_signal: Removing human-required label indicates readiness

Recovery Workflow Logic

# Failure recovery detection
detect-recovery-ready:
  steps:
    - name: Check for recovery-ready issues
      run: |
        # Find issues ready for automated retry
        RECOVERY_ISSUES=$(gh issue list \
          --label "cascade-failed" \
          --state open \
          --jq '.[] | select(.labels | contains(["cascade-failed"]) and (contains(["human-required"]) | not))')

        # For each recovery-ready issue
        echo "$RECOVERY_ISSUES" | jq -r '.number' | while read ISSUE_NUMBER; do
          # Update state: cascade-failed → cascade-active
          gh issue edit "$ISSUE_NUMBER" \
            --remove-label "cascade-failed" \
            --add-label "cascade-active"

          # Trigger automatic retry
          gh workflow run "Cascade Integration" \
            --repo ${{ github.repository }} \
            -f issue_number="$ISSUE_NUMBER"
        done

Human Recovery Workflow

Failure-to-Recovery Process

# Human intervention workflow
failure_recovery:
  failure_detection: Cascade fails, issue gets cascade-failed + human-required
  failure_issue: Technical details in separate high-priority issue
  human_investigation: Developer reviews failure and implements fixes
  recovery_signal: Human removes human-required label from tracking issue
  auto_retry: Monitor detects label removal and retries cascade
  outcome: Either success or new failure issue creation

Benefits and Rationale

Strategic Advantages

Enhanced Team Control and Predictability

  • Teams have explicit control over integration timing and batching
  • Predictable, documented process eliminates surprising automatic triggers
  • Clear instructions and guidance reduce cognitive load during operations
  • Flexible timing allows batching changes or avoiding maintenance windows

Improved Reliability and Visibility

  • No dependencies on complex GitHub event triggering edge cases
  • Git-based detection provides reliable branch state comparison
  • Complete audit trail through issue lifecycle tracking
  • Clear error recovery path with obvious next steps

Self-Healing System Architecture

  • Automatic detection and recovery of missed manual triggers
  • Label-based state management enables sophisticated failure recovery
  • Human-automation handoff points clearly defined and tracked
  • Robust error handling with multiple failure attempts tracked separately

Technical Architecture Benefits

Simplified Event Handling

  • Eliminates complex conditional logic in workflow triggering
  • Removes dependencies on workflow file availability in target branches
  • Git-based detection more reliable than GitHub event filtering
  • Clear separation between detection, triggering, and execution

Comprehensive State Management

  • Issue-based coordination provides persistent state tracking
  • Label-based state machine enables sophisticated workflow control
  • Comment-based progress updates provide real-time visibility
  • Complete failure/recovery history maintained in tracking issues

Alternative Approaches Considered

Direct Push Triggers

Approach: Simple push-based triggering on fork_upstream branch

on:
  push:
    branches: [fork_upstream]
  • Pros: Simple implementation, immediate triggering response
  • Cons: Fires on all pushes, no filtering by intent, no human control
  • Decision: Rejected due to unwanted triggers and lack of control

Combined PR and Push Triggers

Approach: Complex multi-trigger system with conditional logic

on:
  push:
    branches: [fork_upstream, fork_integration]
  pull_request:
    types: [closed]
    branches: [fork_upstream, fork_integration]
  • Pros: Handles various trigger scenarios comprehensively
  • Cons: Complex conditional logic, difficult debugging, poor error handling
  • Decision: Rejected due to complexity and reliability concerns

External Webhook System

Approach: External service for centralized trigger management

  • Pros: Maximum flexibility, sophisticated control capabilities
  • Cons: Additional infrastructure, security concerns, maintenance overhead
  • Decision: Rejected due to complexity without proportional benefits

High-Frequency Scheduled Polling

Approach: Frequent scheduled checks for changes

on:
  schedule:
    - cron: '*/5 * * * *'  # Every 5 minutes
  • Pros: Guaranteed to catch changes eventually
  • Cons: Up to 5-minute delay, inefficient resource usage, doesn't scale
  • Decision: Rejected as primary approach (kept as backup in monitor)

Consequences and Trade-offs

Positive Outcomes

Enhanced Human Experience

  • Explicit control over integration timing builds team confidence
  • Clear instructions eliminate confusion about next steps
  • Complete visibility into cascade progress through issue tracking
  • Predictable process enables reliable team adoption

System Reliability Improvements

  • No dependency on complex GitHub event triggering mechanisms
  • Git-based detection provides robust change identification
  • Automated safety net prevents missed triggers from blocking progress
  • Self-healing recovery system reduces manual intervention requirements

Operational Excellence

  • Complete audit trail for all cascade decisions and outcomes
  • Clear error recovery path with obvious resolution steps
  • Sophisticated failure/recovery tracking through issue labels
  • Flexible timing control enables operational best practices

Trade-offs and Limitations

Manual Process Requirements

  • Humans must remember to trigger cascades after sync completion
  • Potential delays up to 6 hours if manual trigger is forgotten
  • Team learning curve for understanding manual trigger process

System Complexity

  • Issue lifecycle tracking adds workflow coordination complexity
  • Cross-workflow dependencies require understanding of interaction patterns
  • Monitor workflow dependency for safety net functionality

Operational Considerations

  • Additional workflow file increases maintenance surface area
  • Testing requires both trigger detection and cascade execution validation
  • Debugging may span multiple workflows requiring end-to-end understanding

Success Metrics

Quantitative Indicators

  • Manual Trigger Adoption: >90% of sync merges followed by manual cascade triggers within 2 hours
  • Safety Net Effectiveness: 100% of missed manual triggers detected by monitor within 6 hours
  • Issue Lifecycle Completeness: 95%+ of cascades with complete issue tracking
  • Conflict Resolution SLA: 48 hours with automatic escalation for conflicts

Qualitative Indicators

  • Team reports high confidence in cascade triggering process
  • Clear understanding of manual trigger requirements across all team members
  • Effective error recovery with minimal support intervention required
  • Complete audit trail satisfaction for operational compliance requirements

Integration Points

Workflow Integration

Sync Workflow Coordination

  • Sync workflow creates PRs with upstream-sync label for identification
  • Tracking issues created with explicit manual trigger instructions
  • Human review, merge, and manual cascade trigger workflow

Cascade Workflow Enhancement

  • Cascade runs on workflow_dispatch (manual or monitor-triggered)
  • Issue label updates and progress comments throughout process
  • Conflict handling, integration, and production PR creation coordination

Label Management Integration (per ADR-008)

  • Predefined labels: upstream-sync, cascade-trigger-failed, human-required
  • Existing label-based notification system leverage
  • Consistency with other workflow patterns maintained
  • ADR-001: Three-branch strategy defines cascade target branches
  • ADR-005: Conflict handling within cascade workflows
  • ADR-008: Centralized label management for state tracking
  • ADR-009: Asymmetric cascade review requirements
  • ADR-020: Human-required labeling coordinates activities

This cascade monitor pattern provides reliable, human-controlled integration with automated safety nets, ensuring predictable operations while maintaining system reliability through comprehensive state management and self-healing recovery capabilities.