Production Jekyll deployments require sophisticated error handling and monitoring to ensure reliability and quick issue resolution. By combining Ruby's exception handling capabilities with Cloudflare's monitoring tools and GitHub Actions' workflow tracking, you can build a robust observability system. This guide explores advanced error handling patterns, distributed tracing, alerting systems, and performance monitoring specifically tailored for Jekyll deployments across the GitHub-Cloudflare pipeline.

In This Guide

Error Handling Architecture and Patterns
Advanced Ruby Exception Handling and Recovery
Cloudflare Analytics and Error Tracking
GitHub Actions Workflow Monitoring and Alerting
Distributed Tracing Across Deployment Pipeline
Intelligent Alerting and Incident Response

Error Handling Architecture and Patterns

A comprehensive error handling architecture spans the entire deployment pipeline from local development to production edge delivery. The system must capture, categorize, and handle errors at each stage while maintaining context for debugging.

The architecture implements a layered approach with error handling at the build layer (Ruby/Jekyll), deployment layer (GitHub Actions), and runtime layer (Cloudflare Workers/Pages). Each layer captures errors with appropriate context and forwards them to a centralized error aggregation system. The system supports error classification, automatic recovery attempts, and context preservation for post-mortem analysis.


# Error Handling Architecture:
# 1. Build Layer Errors:
#    - Jekyll build failures (template errors, data validation)
#    - Ruby gem dependency issues
#    - Asset compilation failures
#    - Content validation errors
#
# 2. Deployment Layer Errors:
#    - GitHub Actions workflow failures
#    - Cloudflare Pages deployment failures
#    - DNS configuration errors
#    - Environment variable issues
#
# 3. Runtime Layer Errors:
#    - 4xx/5xx errors from Cloudflare edge
#    - Worker runtime exceptions
#    - API integration failures
#    - Cache invalidation errors
#
# 4. Monitoring Layer:
#    - Error aggregation and deduplication
#    - Alert routing and escalation
#    - Performance anomaly detection
#    - Automated recovery procedures

# Error Classification:
# - Fatal: Requires immediate human intervention
# - Recoverable: Automatic recovery can be attempted
# - Transient: Temporary issues that may resolve themselves
# - Warning: Non-critical issues for investigation

Advanced Ruby Exception Handling and Recovery

Ruby provides sophisticated exception handling capabilities that can be extended for Jekyll deployments with automatic recovery, error context preservation, and intelligent retry logic.


# lib/deployment_error_handler.rb
module DeploymentErrorHandler
  class Error < StandardError
    attr_reader :context, :severity, :recovery_attempts
    
    def initialize(message, context = {}, severity = :error)
      super(message)
      @context = context
      @severity = severity
      @recovery_attempts = 0
    end
    
    def to_h
      {
        message: message,
        backtrace: backtrace,
        context: @context,
        severity: @severity.to_s,
        timestamp: Time.now.utc.iso8601,
        recovery_attempts: @recovery_attempts
      }
    end
  end
  
  class BuildError < Error
    def initialize(message, file: nil, line: nil, template: nil)
      super(message, {
        file: file,
        line: line,
        template: template,
        stage: 'build'
      }, :critical)
    end
  end
  
  class DeploymentError < Error
    def initialize(message, deployment_id: nil, stage: nil)
      super(message, {
        deployment_id: deployment_id,
        stage: stage || 'deployment',
        environment: ENV['JEKYLL_ENV'] || 'production'
      }, :critical)
    end
  end
  
  # Error handler with recovery logic
  class Handler
    def initialize(config = {})
      @config = config
      @error_store = ErrorStore.new(config[:error_store])
      @recovery_strategies = load_recovery_strategies
      @notifiers = load_notifiers
    end
    
    def handle(error, context = {})
      # Add additional context
      error.context.merge!(context)
      
      # Store error for analysis
      @error_store.record(error)
      
      # Attempt recovery for recoverable errors
      if recoverable?(error)
        attempt_recovery(error)
      end
      
      # Notify based on severity
      notify(error) if should_notify?(error)
      
      # Re-raise fatal errors
      raise error if fatal?(error)
    end
    
    def attempt_recovery(error)
      error.recovery_attempts += 1
      
      @recovery_strategies.each do |strategy|
        if strategy.applies_to?(error)
          begin
            strategy.recover(error)
            log_recovery_success(error, strategy)
            return true
          rescue => recovery_error
            log_recovery_failure(error, strategy, recovery_error)
          end
        end
      end
      
      false
    end
    
    def with_error_handling(context = {}, &block)
      begin
        block.call
      rescue Error => e
        handle(e, context)
        raise e
      rescue => e
        # Convert generic errors to typed errors
        typed_error = classify_error(e, context)
        handle(typed_error, context)
        raise typed_error
      end
    end
  end
  
  # Recovery strategies for common errors
  class RecoveryStrategy
    def applies_to?(error)
      false
    end
    
    def recover(error)
      raise NotImplementedError
    end
  end
  
  class GemInstallationRecovery < RecoveryStrategy
    def applies_to?(error)
      error.is_a?(BuildError) && 
      error.message.include?('Gem::LoadError') ||
      error.message.include?('bundle install')
    end
    
    def recover(error)
      # Attempt to clear bundle cache and retry
      system('bundle clean --force')
      system('bundle install')
      
      # Verify recovery
      raise 'Recovery failed' unless system('bundle check')
    end
  end
  
  class CloudflareDeploymentRecovery < RecoveryStrategy
    def applies_to?(error)
      error.is_a?(DeploymentError) &&
      error.message.include?('Cloudflare')
    end
    
    def recover(error)
      # Extract deployment ID from error context
      deployment_id = error.context[:deployment_id]
      
      if deployment_id
        # Attempt to retry deployment
        client = Cloudflare::Client.new(ENV['CLOUDFLARE_API_TOKEN'])
        client.retry_deployment(deployment_id)
      else
        # Trigger new deployment
        trigger_new_deployment
      end
    end
  end
  
  # Error store with aggregation
  class ErrorStore
    def initialize(store_config)
      @store = case store_config[:type]
              when :redis then Redis.new(store_config[:url])
              when :file then FileStore.new(store_config[:path])
              when :cloudflare then CloudflareStore.new(store_config[:token])
              else MemoryStore.new
              end
      @aggregator = ErrorAggregator.new
    end
    
    def record(error)
      error_data = error.to_h
      
      # Aggregate similar errors
      fingerprint = @aggregator.fingerprint(error)
      
      if @store.exists?(fingerprint)
        # Update existing error count
        existing = @store.get(fingerprint)
        existing[:count] += 1
        existing[:last_occurrence] = Time.now.utc.iso8601
        @store.set(fingerprint, existing)
      else
        # Store new error
        error_data[:fingerprint] = fingerprint
        error_data[:count] = 1
        error_data[:first_occurrence] = Time.now.utc.iso8601
        @store.set(fingerprint, error_data)
      end
    end
  end
  
  # Jekyll plugin for error handling
  class JekyllErrorHandler
    def initialize(site)
      @site = site
      @handler = DeploymentErrorHandler::Handler.new(
        site.config['error_handling'] || {}
      )
    end
    
    def handle_build_errors
      @handler.with_error_handling(stage: 'jekyll_build') do
        yield
      end
    end
    
    def validate_site
      @handler.with_error_handling(stage: 'site_validation') do
        validate_configuration
        validate_content
        validate_urls
      end
    end
  end
end

# Integrate with Jekyll build process
Jekyll::Hooks.register :site, :pre_render do |site|
  error_handler = DeploymentErrorHandler::JekyllErrorHandler.new(site)
  error_handler.validate_site
end

Jekyll::Hooks.register :site, :post_write do |site|
  error_handler = DeploymentErrorHandler::JekyllErrorHandler.new(site)
  error_handler.handle_build_errors do
    # Post-build validation
    validate_build_output(site)
  end
end

Cloudflare Analytics and Error Tracking

Cloudflare provides comprehensive analytics and error tracking through its dashboard and API. Advanced monitoring integrates these capabilities with custom error tracking for Jekyll deployments.


# lib/cloudflare_monitoring.rb
module CloudflareMonitoring
  class AnalyticsCollector
    def initialize(api_token, zone_id)
      @client = Cloudflare::Client.new(api_token)
      @zone_id = zone_id
      @cache = {}
      @last_fetch = nil
    end
    
    def fetch_errors(time_range = 'last_24_hours')
      # Fetch error analytics from Cloudflare
      data = @client.analytics(
        @zone_id,
        metrics: ['requests', 'status_4xx', 'status_5xx', 'status_403', 'status_404'],
        dimensions: ['clientCountry', 'path', 'status'],
        time_range: time_range
      )
      
      process_error_data(data)
    end
    
    def fetch_performance(time_range = 'last_hour')
      # Fetch performance metrics
      data = @client.analytics(
        @zone_id,
        metrics: ['pageViews', 'bandwidth', 'visits', 'requests'],
        dimensions: ['path', 'referer'],
        time_range: time_range,
        granularity: 'hour'
      )
      
      process_performance_data(data)
    end
    
    def detect_anomalies
      # Detect anomalies in traffic patterns
      current = fetch_performance('last_hour')
      historical = fetch_historical_baseline
      
      anomalies = []
      
      current.each do |metric, value|
        baseline = historical[metric]
        
        if baseline && anomaly_detected?(value, baseline)
          anomalies << {
            metric: metric,
            current: value,
            baseline: baseline,
            deviation: calculate_deviation(value, baseline),
            timestamp: Time.now.utc.iso8601
          }
        end
      end
      
      anomalies
    end
    
    private
    
    def process_error_data(data)
      errors = []
      
      data['results'].each do |result|
        if result['status'].to_i >= 400
          errors << {
            status: result['status'],
            path: result['path'],
            count: result['requests'],
            country: result['clientCountry'],
            timestamp: Time.now.utc.iso8601
          }
        end
      end
      
      errors.sort_by { |e| -e[:count] }
    end
    
    def fetch_historical_baseline
      # Fetch historical data for comparison
      @cache[:historical_baseline] ||= begin
        data = @client.analytics(
          @zone_id,
          metrics: ['requests', 'bandwidth', 'visits'],
          time_range: 'last_30_days',
          granularity: 'day'
        )
        
        calculate_baseline(data)
      end
    end
    
    def calculate_baseline(data)
      # Calculate average and standard deviation
      metrics = Hash.new { |h, k| h[k] = [] }
      
      data['results'].each do |result|
        metrics['requests'] << result['requests']
        metrics['bandwidth'] << result['bandwidth']
        metrics['visits'] << result['visits']
      end
      
      baseline = {}
      metrics.each do |metric, values|
        baseline[metric] = {
          average: values.sum / values.size.to_f,
          std_dev: standard_deviation(values),
          min: values.min,
          max: values.max
        }
      end
      
      baseline
    end
  end
  
  class ErrorTracker
    def initialize(api_token, account_id)
      @client = Cloudflare::Client.new(api_token)
      @account_id = account_id
    end
    
    def track_error(error_data, severity = :error)
      # Send error to Cloudflare Logs
      log_entry = {
        message: error_data[:message],
        severity: severity,
        timestamp: error_data[:timestamp] || Time.now.utc.iso8601,
        context: error_data[:context] || {},
        environment: ENV['JEKYLL_ENV'] || 'production'
      }
      
      @client.send_logs(@account_id, 'jekyll-errors', [log_entry])
    end
    
    def get_error_summary(time_range = 'last_24_hours')
      # Fetch error summary from logs
      query = '| filter severity in ("error", "critical") 
               | summarize count() by bin(timestamp, 1h), severity'
      
      @client.query_logs(@account_id, query, time_range)
    end
    
    def create_alert_policy(conditions, notifications = [])
      # Create alert policy for specific error conditions
      policy = {
        name: "Jekyll Deployment Alerts",
        enabled: true,
        alert_type: "stream",
        conditions: conditions,
        filters: {
          source: "worker_logs",
          service: "jekyll-deployment"
        },
        notifications: notifications
      }
      
      @client.create_alert_policy(@account_id, policy)
    end
  end
  
  # Worker for error tracking
  // workers/error-tracker.js
  export default {
    async fetch(request, env, ctx) {
      const url = new URL(request.url)
      
      if (url.pathname === '/api/errors' && request.method === 'POST') {
        return handleErrorReport(request, env, ctx)
      }
      
      if (url.pathname === '/api/errors/summary') {
        return getErrorSummary(env)
      }
      
      return new Response('Not found', { status: 404 })
    }
  }
  
  async function handleErrorReport(request, env, ctx) {
    const errorData = await request.json()
    
    // Validate error data
    if (!errorData.message || !errorData.timestamp) {
      return new Response('Invalid error data', { status: 400 })
    }
    
    // Store error in KV
    const errorId = generateErrorId()
    await env.ERRORS_KV.put(
      `error:${errorId}`,
      JSON.stringify({
        ...errorData,
        id: errorId,
        received: new Date().toISOString()
      }),
      { expirationTtl: 604800 } // 7 days
    )
    
    // Update error aggregation
    await updateErrorAggregation(errorData, env)
    
    // Trigger alerts if needed
    if (errorData.severity === 'critical') {
      await triggerCriticalAlert(errorData, env, ctx)
    }
    
    return new Response(JSON.stringify({ id: errorId }), {
      headers: { 'Content-Type': 'application/json' }
    })
  }
  
  async function updateErrorAggregation(errorData, env) {
    const hour = Math.floor(Date.now() / 3600000) * 3600000
    const key = `aggregate:${hour}:${errorData.type || 'unknown'}`
    
    const current = await env.ERRORS_KV.get(key, { type: 'json' }) || {
      count: 0,
      first_seen: errorData.timestamp,
      last_seen: errorData.timestamp
    }
    
    current.count += 1
    current.last_seen = errorData.timestamp
    
    await env.ERRORS_KV.put(key, JSON.stringify(current), {
      expirationTtl: 172800 // 48 hours
    })
  }

GitHub Actions Workflow Monitoring and Alerting

GitHub Actions provides extensive workflow monitoring capabilities that can be enhanced with custom Ruby scripts for deployment tracking and alerting.


# .github/workflows/monitoring.yml
name: Deployment Monitoring

on:
  workflow_run:
    workflows: ["Deploy to Production"]
    types:
      - completed
      - requested
  schedule:
    - cron: '*/5 * * * *'  # Check every 5 minutes

jobs:
  monitor-deployment:
    runs-on: ubuntu-latest
    steps:
      - name: Check workflow status
        id: check_status
        run: |
          ruby .github/scripts/check_deployment_status.rb
          
      - name: Send alerts if needed
        if: steps.check_status.outputs.status != 'success'
        run: |
          ruby .github/scripts/send_alert.rb \
            --status $ \
            --workflow $ \
            --run-id $
            
      - name: Update deployment dashboard
        run: |
          ruby .github/scripts/update_dashboard.rb \
            --run-id $ \
            --status $ \
            --duration $

  health-check:
    runs-on: ubuntu-latest
    steps:
      - name: Run comprehensive health check
        run: |
          ruby .github/scripts/health_check.rb
          
      - name: Report health status
        if: always()
        run: |
          ruby .github/scripts/report_health.rb \
            --exit-code $

# .github/scripts/check_deployment_status.rb
#!/usr/bin/env ruby
require 'octokit'
require 'json'
require 'time'

class DeploymentMonitor
  def initialize(token, repository)
    @client = Octokit::Client.new(access_token: token)
    @repository = repository
  end
  
  def check_workflow_run(run_id)
    run = @client.workflow_run(@repository, run_id)
    
    {
      status: run.status,
      conclusion: run.conclusion,
      duration: calculate_duration(run),
      artifacts: run.artifacts,
      jobs: fetch_jobs(run_id),
      created_at: run.created_at,
      updated_at: run.updated_at
    }
  end
  
  def check_recent_deployments(limit = 5)
    runs = @client.workflow_runs(
      @repository,
      workflow_file_name: 'deploy.yml',
      per_page: limit
    )
    
    runs.workflow_runs.map do |run|
      {
        id: run.id,
        status: run.status,
        conclusion: run.conclusion,
        created_at: run.created_at,
        head_branch: run.head_branch,
        head_sha: run.head_sha
      }
    end
  end
  
  def deployment_health_score
    recent = check_recent_deployments(10)
    
    successful = recent.count { |r| r[:conclusion] == 'success' }
    total = recent.size
    
    return 100 if total == 0
    
    (successful.to_f / total * 100).round(2)
  end
  
  private
  
  def calculate_duration(run)
    if run.status == 'completed' && run.conclusion == 'success'
      start_time = Time.parse(run.created_at)
      end_time = Time.parse(run.updated_at)
      (end_time - start_time).round(2)
    else
      nil
    end
  end
  
  def fetch_jobs(run_id)
    jobs = @client.workflow_run_jobs(@repository, run_id)
    
    jobs.jobs.map do |job|
      {
        name: job.name,
        status: job.status,
        conclusion: job.conclusion,
        started_at: job.started_at,
        completed_at: job.completed_at,
        steps: job.steps.map { |s| { name: s.name, conclusion: s.conclusion } }
      }
    end
  end
end

if __FILE__ == $0
  token = ENV['GITHUB_TOKEN']
  repository = ENV['GITHUB_REPOSITORY']
  run_id = ARGV[0] || ENV['GITHUB_RUN_ID']
  
  monitor = DeploymentMonitor.new(token, repository)
  
  if run_id
    result = monitor.check_workflow_run(run_id)
    
    # Output for GitHub Actions
    puts "status=#{result[:conclusion] || result[:status]}"
    puts "duration=#{result[:duration] || 0}"
    
    # JSON output
    File.write('deployment_status.json', JSON.pretty_generate(result))
  else
    # Check deployment health
    score = monitor.deployment_health_score
    puts "health_score=#{score}"
    
    if score < 80
      puts "Health check failed: #{score}% success rate"
      exit 1
    end
  end
end

# .github/scripts/send_alert.rb
#!/usr/bin/env ruby
require 'slack-ruby-client'
require 'discordrb'
require 'json'

class AlertSender
  def initialize(config)
    @config = config
    @notifiers = build_notifiers
  end
  
  def send_alert(alert_data)
    alert_data[:timestamp] = Time.now.utc.iso8601
    
    @notifiers.each do |notifier|
      begin
        notifier.send(alert_data)
      rescue => e
        log("Failed to send alert via #{notifier.class}: #{e.message}")
      end
    end
    
    # Store alert for audit
    store_alert(alert_data)
  end
  
  private
  
  def build_notifiers
    notifiers = []
    
    if @config[:slack_webhook]
      notifiers << SlackNotifier.new(@config[:slack_webhook])
    end
    
    if @config[:discord_webhook]
      notifiers << DiscordNotifier.new(@config[:discord_webhook])
    end
    
    if @config[:pagerduty_key]
      notifiers << PagerDutyNotifier.new(@config[:pagerduty_key])
    end
    
    notifiers
  end
  
  def store_alert(alert_data)
    # Store in Cloudflare KV via Worker
    uri = URI.parse('https://alerts.yourdomain.com/api/alerts')
    
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true
    
    request = Net::HTTP::Post.new(uri.path)
    request['Authorization'] = "Bearer #{@config[:alert_token]}"
    request['Content-Type'] = 'application/json'
    request.body = alert_data.to_json
    
    http.request(request)
  end
end

class SlackNotifier
  def initialize(webhook_url)
    @webhook_url = webhook_url
  end
  
  def send(alert_data)
    payload = {
      text: format_message(alert_data),
      attachments: [
        {
          color: alert_color(alert_data[:severity]),
          fields: format_fields(alert_data),
          ts: Time.now.to_i
        }
      ]
    }
    
    uri = URI.parse(@webhook_url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true
    
    request = Net::HTTP::Post.new(uri.path)
    request['Content-Type'] = 'application/json'
    request.body = payload.to_json
    
    http.request(request)
  end
  
  private
  
  def format_message(alert_data)
    emoji = case alert_data[:severity]
            when 'critical' then '🚨'
            when 'error' then '❌'
            when 'warning' then '⚠️'
            else 'ℹ️'
            end
    
    "#{emoji} *#{alert_data[:title]}*\n#{alert_data[:message]}"
  end
end

Distributed Tracing Across Deployment Pipeline

Distributed tracing provides end-to-end visibility across the deployment pipeline, connecting errors and performance issues across different systems and services.


# lib/distributed_tracing.rb
module DistributedTracing
  class Trace
    attr_reader :trace_id, :spans, :metadata
    
    def initialize(trace_id = nil, metadata = {})
      @trace_id = trace_id || generate_trace_id
      @spans = []
      @metadata = metadata
      @start_time = Time.now.utc
    end
    
    def start_span(name, attributes = {})
      span = Span.new(
        name: name,
        trace_id: @trace_id,
        span_id: generate_span_id,
        parent_span_id: current_span_id,
        attributes: attributes,
        start_time: Time.now.utc
      )
      
      @spans << span
      span
    end
    
    def finish_span(span, status = :ok, error = nil)
      span.finish(status, error)
    end
    
    def export
      {
        trace_id: @trace_id,
        metadata: @metadata,
        spans: @spans.map(&:to_h),
        duration: (Time.now.utc - @start_time).round(3),
        start_time: @start_time.iso8601,
        end_time: Time.now.utc.iso8601
      }
    end
    
    def send_to_collector(collector_url)
      exporter = Exporter.new(collector_url)
      exporter.export(self)
    end
    
    private
    
    def generate_trace_id
      SecureRandom.hex(16)
    end
    
    def generate_span_id
      SecureRandom.hex(8)
    end
    
    def current_span_id
      @spans.last&.span_id
    end
  end
  
  class Span
    attr_reader :name, :trace_id, :span_id, :parent_span_id, :attributes
    attr_reader :start_time, :end_time, :status, :error
    
    def initialize(name:, trace_id:, span_id:, parent_span_id:, attributes:, start_time:)
      @name = name
      @trace_id = trace_id
      @span_id = span_id
      @parent_span_id = parent_span_id
      @attributes = attributes
      @start_time = start_time
      @events = []
    end
    
    def add_event(name, attributes = {})
      @events << {
        name: name,
        attributes: attributes,
        timestamp: Time.now.utc.iso8601
      }
    end
    
    def finish(status = :ok, error = nil)
      @end_time = Time.now.utc
      @status = status
      @error = error
      @duration = (@end_time - @start_time).round(6)
    end
    
    def to_h
      {
        name: @name,
        trace_id: @trace_id,
        span_id: @span_id,
        parent_span_id: @parent_span_id,
        attributes: @attributes,
        start_time: @start_time.iso8601,
        end_time: @end_time&.iso8601,
        duration: @duration,
        status: @status,
        error: @error&.message,
        events: @events
      }
    end
  end
  
  # Jekyll build tracing
  class JekyllTracer
    def initialize(trace)
      @trace = trace
      @current_span = nil
    end
    
    def trace_build(&block)
      @current_span = @trace.start_span('jekyll_build', {
        environment: ENV['JEKYLL_ENV'],
        site_source: @site.source,
        site_dest: @site.dest
      })
      
      begin
        result = block.call
        @trace.finish_span(@current_span, :ok)
        result
      rescue => e
        @current_span.add_event('build_error', { error: e.message })
        @trace.finish_span(@current_span, :error, e)
        raise e
      end
    end
    
    def trace_generation(generator_name, &block)
      span = @trace.start_span("generate_#{generator_name}", {
        generator: generator_name
      })
      
      begin
        result = block.call
        @trace.finish_span(span, :ok)
        result
      rescue => e
        span.add_event('generation_error', { error: e.message })
        @trace.finish_span(span, :error, e)
        raise e
      end
    end
  end
  
  # GitHub Actions workflow tracing
  class WorkflowTracer
    def initialize(trace_id, run_id)
      @trace = Trace.new(trace_id, {
        workflow_run_id: run_id,
        repository: ENV['GITHUB_REPOSITORY'],
        actor: ENV['GITHUB_ACTOR']
      })
    end
    
    def trace_job(job_name, &block)
      span = @trace.start_span("job_#{job_name}", {
        job: job_name,
        runner: ENV['RUNNER_NAME']
      })
      
      begin
        result = block.call
        @trace.finish_span(span, :ok)
        result
      rescue => e
        span.add_event('job_failed', { error: e.message })
        @trace.finish_span(span, :error, e)
        raise e
      end
    end
  end
  
  # Cloudflare Pages deployment tracing
  class DeploymentTracer
    def initialize(trace_id, deployment_id)
      @trace = Trace.new(trace_id, {
        deployment_id: deployment_id,
        project: ENV['CLOUDFLARE_PROJECT_NAME'],
        environment: ENV['CLOUDFLARE_ENVIRONMENT']
      })
    end
    
    def trace_stage(stage_name, &block)
      span = @trace.start_span("deployment_#{stage_name}", {
        stage: stage_name,
        timestamp: Time.now.utc.iso8601
      })
      
      begin
        result = block.call
        @trace.finish_span(span, :ok)
        result
      rescue => e
        span.add_event('stage_failed', {
          error: e.message,
          retry_attempt: @retry_count || 0
        })
        @trace.finish_span(span, :error, e)
        raise e
      end
    end
  end
end

# Integration with Jekyll
Jekyll::Hooks.register :site, :after_reset do |site|
  trace_id = ENV['TRACE_ID'] || SecureRandom.hex(16)
  tracer = DistributedTracing::JekyllTracer.new(
    DistributedTracing::Trace.new(trace_id, {
      site_config: site.config.keys,
      jekyll_version: Jekyll::VERSION
    })
  )
  
  site.data['_tracer'] = tracer
end

# Worker for trace collection
// workers/trace-collector.js
export default {
  async fetch(request, env, ctx) {
    const url = new URL(request.url)
    
    if (url.pathname === '/api/traces' && request.method === 'POST') {
      return handleTraceSubmission(request, env, ctx)
    }
    
    return new Response('Not found', { status: 404 })
  }
}

async function handleTraceSubmission(request, env, ctx) {
  const trace = await request.json()
  
  // Validate trace
  if (!trace.trace_id || !trace.spans) {
    return new Response('Invalid trace data', { status: 400 })
  }
  
  // Store trace
  await storeTrace(trace, env)
  
  // Process for analytics
  await processTraceAnalytics(trace, env, ctx)
  
  return new Response(JSON.stringify({ received: true }))
}

async function storeTrace(trace, env) {
  const traceKey = `trace:${trace.trace_id}`
  
  // Store full trace
  await env.TRACES_KV.put(traceKey, JSON.stringify(trace), {
    metadata: {
      start_time: trace.start_time,
      duration: trace.duration,
      span_count: trace.spans.length
    }
  })
  
  // Index spans for querying
  for (const span of trace.spans) {
    const spanKey = `span:${trace.trace_id}:${span.span_id}`
    await env.SPANS_KV.put(spanKey, JSON.stringify(span))
    
    // Index by span name
    const indexKey = `index:span_name:${span.name}`
    await env.SPANS_KV.put(indexKey, JSON.stringify({
      trace_id: trace.trace_id,
      span_id: span.span_id,
      start_time: span.start_time
    }))
  }
}

Intelligent Alerting and Incident Response

An intelligent alerting system categorizes issues, routes them appropriately, and provides context for quick resolution while avoiding alert fatigue.


# lib/alerting_system.rb
module AlertingSystem
  class AlertManager
    def initialize(config)
      @config = config
      @routing_rules = load_routing_rules
      @escalation_policies = load_escalation_policies
      @alert_history = AlertHistory.new
      @deduplicator = AlertDeduplicator.new
    end
    
    def create_alert(alert_data)
      # Deduplicate similar alerts
      fingerprint = @deduplicator.fingerprint(alert_data)
      
      if @deduplicator.recent_duplicate?(fingerprint)
        log("Duplicate alert suppressed: #{fingerprint}")
        return nil
      end
      
      # Create alert with context
      alert = Alert.new(alert_data.merge(fingerprint: fingerprint))
      
      # Determine routing
      route = determine_route(alert)
      
      # Apply escalation policy
      escalation = determine_escalation(alert)
      
      # Store alert
      @alert_history.record(alert)
      
      # Send notifications
      send_notifications(alert, route, escalation)
      
      alert
    end
    
    def resolve_alert(alert_id, resolution_data = {})
      alert = @alert_history.find(alert_id)
      
      if alert
        alert.resolve(resolution_data)
        @alert_history.update(alert)
        
        # Send resolution notifications
        send_resolution_notifications(alert)
      end
    end
    
    private
    
    def determine_route(alert)
      @routing_rules.find do |rule|
        rule.matches?(alert)
      end || default_route
    end
    
    def determine_escalation(alert)
      policy = @escalation_policies.find { |p| p.applies_to?(alert) }
      policy || default_escalation_policy
    end
    
    def send_notifications(alert, route, escalation)
      # Send to primary channels
      route.channels.each do |channel|
        send_to_channel(alert, channel)
      end
      
      # Schedule escalation if needed
      if escalation.enabled?
        schedule_escalation(alert, escalation)
      end
    end
    
    def send_to_channel(alert, channel)
      notifier = NotifierFactory.create(channel.type, channel.config)
      notifier.send(alert.formatted_for(channel.format))
    rescue => e
      log("Failed to send to #{channel.type}: #{e.message}")
    end
  end
  
  class Alert
    attr_reader :id, :fingerprint, :severity, :status, :created_at, :resolved_at
    attr_accessor :context, :assignee, :notes
    
    def initialize(data)
      @id = SecureRandom.uuid
      @fingerprint = data[:fingerprint]
      @title = data[:title]
      @description = data[:description]
      @severity = data[:severity] || :error
      @status = :open
      @context = data[:context] || {}
      @created_at = Time.now.utc
      @updated_at = @created_at
      @resolved_at = nil
      @assignee = nil
      @notes = []
      @notifications = []
    end
    
    def resolve(resolution_data = {})
      @status = :resolved
      @resolved_at = Time.now.utc
      @resolution = resolution_data[:resolution] || 'manual'
      @resolution_notes = resolution_data[:notes]
      @updated_at = @resolved_at
      
      add_note("Alert resolved: #{@resolution}")
    end
    
    def add_note(text, author = 'system')
      @notes << {
        text: text,
        author: author,
        timestamp: Time.now.utc.iso8601
      }
      @updated_at = Time.now.utc
    end
    
    def formatted_for(format = :slack)
      case format
      when :slack
        format_for_slack
      when :email
        format_for_email
      when :webhook
        format_for_webhook
      else
        to_h
      end
    end
    
    private
    
    def format_for_slack
      {
        text: "*#{@title}*",
        attachments: [
          {
            color: severity_color,
            fields: [
              {
                title: "Description",
                value: @description,
                short: false
              },
              {
                title: "Severity",
                value: @severity.to_s.upcase,
                short: true
              },
              {
                title: "Status",
                value: @status.to_s.upcase,
                short: true
              }
            ],
            ts: @created_at.to_i
          }
        ]
      }
    end
    
    def severity_color
      case @severity
      when :critical then "#FF0000"
      when :error then "#FF6B6B"
      when :warning then "#FFA726"
      when :info then "#42A5F5"
      else "#78909C"
      end
    end
  end
  
  class AlertDeduplicator
    def initialize(window_minutes = 5)
      @window = window_minutes * 60
      @recent_alerts = {}
    end
    
    def fingerprint(alert_data)
      # Create fingerprint from alert characteristics
      components = [
        alert_data[:title],
        alert_data[:severity],
        alert_data.dig(:context, :source),
        alert_data.dig(:context, :error_type)
      ].compact.map(&:to_s).join('|')
      
      Digest::SHA256.hexdigest(components)
    end
    
    def recent_duplicate?(fingerprint)
      if @recent_alerts[fingerprint]
        time_since = Time.now.utc - @recent_alerts[fingerprint]
        time_since < @window
      else
        @recent_alerts[fingerprint] = Time.now.utc
        cleanup_old_alerts
        false
      end
    end
    
    private
    
    def cleanup_old_alerts
      cutoff = Time.now.utc - @window
      @recent_alerts.delete_if { |_, timestamp| timestamp < cutoff }
    end
  end
  
  # Integration with deployment errors
  class DeploymentAlerting
    def initialize(alert_manager)
      @alert_manager = alert_manager
    end
    
    def handle_deployment_error(error, deployment_context = {})
      alert_data = {
        title: "Deployment Failed: #{deployment_context[:stage]}",
        description: error.message,
        severity: determine_severity(error),
        context: {
          source: 'deployment',
          stage: deployment_context[:stage],
          deployment_id: deployment_context[:id],
          environment: deployment_context[:environment],
          error_type: error.class.name,
          backtrace: error.backtrace.first(5)
        }
      }
      
      @alert_manager.create_alert(alert_data)
    end
    
    def handle_build_error(error, build_context = {})
      alert_data = {
        title: "Build Failed: #{build_context[:component]}",
        description: error.message,
        severity: :error,
        context: {
          source: 'build',
          component: build_context[:component],
          file: build_context[:file],
          line: build_context[:line],
          jekyll_env: ENV['JEKYLL_ENV'],
          error_type: error.class.name
        }
      }
      
      @alert_manager.create_alert(alert_data)
    end
    
    private
    
    def determine_severity(error)
      case error
      when DeploymentErrorHandler::BuildError
        :critical
      when DeploymentErrorHandler::DeploymentError
        :error
      when Cloudflare::APIError
        error.message.include?('rate limit') ? :warning : :error
      else
        :error
      end
    end
  end
end

# Rake task for alert testing
namespace :alerts do
  desc 'Test alerting system'
  task :test do
    require_relative 'lib/alerting_system'
    
    alert_manager = AlertingSystem::AlertManager.new(
      config_file: 'config/alerting.yml'
    )
    
    # Test critical alert
    alert_manager.create_alert(
      title: 'Test Critical Alert',
      description: 'This is a test of the alerting system',
      severity: :critical,
      context: {
        source: 'test',
        test_id: '12345'
      }
    )
    
    puts 'Test alert sent successfully'
  end
end

This comprehensive error handling and monitoring system provides enterprise-grade observability for Jekyll deployments. By combining Ruby's error handling capabilities with Cloudflare's monitoring tools and GitHub Actions' workflow tracking, you can achieve rapid detection, diagnosis, and resolution of deployment issues while maintaining high reliability and performance.