DevOps & Platform Engineering

CI/CD pipelines, infrastructure as code, monitoring, and operational practices.

Scope

CI/CD

  • Build automation
  • Test automation
  • Deployment pipelines
  • Release management
  • Feature flags

Infrastructure as Code

  • AWS CDK / CloudFormation
  • Terraform
  • Environment management
  • Configuration management

Observability

  • Logging
  • Metrics
  • Tracing
  • Alerting
  • Dashboards

Operations

  • Incident management
  • On-call procedures
  • Change management
  • Capacity planning

Research Topics

  • AWS CDK vs Terraform
  • GitHub Actions vs AWS CodePipeline
  • Blue/green vs canary deployments
  • Feature flag platforms
  • Observability stack (DataDog, New Relic)
  • Incident management tools
  • SRE practices
  • Platform engineering patterns

Architecture Considerations

CI/CD Pipeline

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│  Code   │───►│  Build  │───►│  Test   │───►│ Deploy  │
│  Push   │    │  + Lint │    │  Suite  │    │  Stage  │
└─────────┘    └─────────┘    └─────────┘    └────┬────┘
                                                   │
                                          ┌────────▼────────┐
                                          │ Integration Test │
                                          └────────┬────────┘
                                                   │
                                          ┌────────▼────────┐
                                          │    Approval     │
                                          └────────┬────────┘
                                                   │
                                          ┌────────▼────────┐
                                          │   Deploy Prod   │
                                          └─────────────────┘

GitHub Actions Workflow

name: Deploy
on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npm run lint
      - run: npm run test
      - run: npm run build

  deploy-staging:
    needs: build
    environment: staging
    steps:
      - run: npx cdk deploy --require-approval never

  integration-test:
    needs: deploy-staging
    steps:
      - run: npm run test:integration

  deploy-prod:
    needs: integration-test
    environment: production
    steps:
      - run: npx cdk deploy --require-approval never

Infrastructure as Code (CDK)

// Example CDK stack
export class BookingStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // DynamoDB table
    const bookingsTable = new dynamodb.Table(this, 'Bookings', {
      partitionKey: { name: 'PK', type: dynamodb.AttributeType.STRING },
      sortKey: { name: 'SK', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      pointInTimeRecovery: true
    });

    // Lambda function
    const bookingHandler = new lambda.Function(this, 'BookingHandler', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda'),
      environment: {
        TABLE_NAME: bookingsTable.tableName
      }
    });

    bookingsTable.grantReadWriteData(bookingHandler);
  }
}

Observability

Logging Strategy

Log Levels:
├── ERROR: System errors, failures
├── WARN: Degraded performance, retries
├── INFO: Business events, transactions
└── DEBUG: Detailed troubleshooting (dev only)

Structured Logging:
{
  "timestamp": "2024-06-15T10:30:00Z",
  "level": "INFO",
  "service": "booking-api",
  "traceId": "abc123",
  "event": "booking.created",
  "bookingRef": "XYZ789",
  "duration": 245
}

Metrics

Key Metrics:
├── Business
│   ├── bookings_created_total
│   ├── revenue_total
│   └── conversion_rate
├── Technical
│   ├── request_duration_seconds
│   ├── error_rate
│   └── concurrent_users
└── Infrastructure
    ├── lambda_invocations
    ├── dynamodb_consumed_capacity
    └── api_gateway_latency

Distributed Tracing

X-Ray Trace:
Request → API Gateway → Lambda → DynamoDB
            │
            └── Lambda → Payment Gateway
                    │
                    └── Lambda → Notification

Deployment Strategies

Blue/Green

Production (Blue) ←── Traffic
                      │
Staging (Green)       │ Switch
                      │
                      ▼
Production (Green) ←── Traffic (after verification)

Canary

Version 1 ←── 95% Traffic
Version 2 ←── 5% Traffic → Monitor → Increase gradually

Feature Flags

// LaunchDarkly / AWS AppConfig
const showNewCheckout = await featureFlags.variation(
  'new-checkout-flow',
  { userId: user.id, tier: user.loyaltyTier },
  false // default
);

if (showNewCheckout) {
  // New flow
} else {
  // Old flow
}

Incident Management

Severity Levels

LevelDescriptionResponse TimeExamples
P1Critical15 minBooking down, payment failures
P2High1 hourDegraded performance
P3Medium4 hoursNon-critical feature issue
P4Low24 hoursMinor bug

Incident Process

Detection → Alert → Acknowledge → Investigate → Mitigate → Resolve → Postmortem

Runbooks

Runbook: High Error Rate
1. Check error logs in CloudWatch
2. Identify affected service
3. Check recent deployments
4. Rollback if deployment-related
5. Scale up if capacity-related
6. Engage on-call engineer

Environments

EnvironmentPurposeData
DevelopmentLocal devMock
IntegrationService testingSynthetic
StagingPre-prod validationAnonymized prod
ProductionLive systemReal

Tools

CI/CD

ToolUse
GitHub ActionsPrimary CI/CD
AWS CodePipelineAWS deployments
ArgoCDKubernetes (if used)

Monitoring

ToolUse
CloudWatchAWS native
DataDogAPM, logs, metrics
PagerDutyAlerting, on-call

IaC

ToolUse
AWS CDKAWS infrastructure
TerraformMulti-cloud option