End-to-End Observability on AWS Using Terraform

A Real-World Serverless Project (Day 23 Completed)

As a DevOps engineer, I’ve learned that building applications is only half the job. The real challenge begins when those applications are running in production. If you can’t observe your system—logs, metrics, alerts, failures—you’re flying blind.

On Day 23 of my hands-on DevOps journey, I completed an end-to-end observability stack on AWS using Terraform for a real-world serverless image-processing application.
This project focuses on building production-grade monitoring, logging, dashboards, alarms, and notifications, all fully automated and reproducible.

This blog walks through what I built, why I built it, and how it works in practice.

Project Overview: Serverless Image Processing Pipeline

At the core of this project is a simple but realistic serverless workflow:

A user uploads an image to an S3 upload bucket
An AWS Lambda function is triggered
The function processes the image into multiple formats:
- WEBP
- JPG
- PNG
- Thumbnail
- Additional resized variants
Processed images are stored in a separate S3 processed bucket
Observability is layered on top using CloudWatch and SNS

This mirrors real-world serverless workloads where visibility, performance, and error detection are critical.

Why Build Observability the Terraform Way

I deliberately implemented observability using Terraform (Infrastructure as Code) instead of manual console configuration.

Why?

Reproducibility – The entire monitoring stack can be recreated in minutes
Auditability – Dashboards, alarms, and thresholds are version-controlled
No configuration drift – No hidden console changes
Environment parity – Same observability for dev, staging, and prod
Production realism – This is how mature teams manage observability

Observability should not be an afterthought—it should be part of the infrastructure itself.

Terraform Project Structure (Production-Style)

The project is organized using reusable Terraform modules, closely resembling real production repositories.

Key Modules

1. SNS Notifications

Three SNS topics:
- Critical alerts
- Performance alerts
- Log-based alerts
Email subscriptions (SMS optional)
SNS topic policies allowing CloudWatch to publish events

2. S3 Buckets

Separate upload and processed buckets
Versioning enabled
Server-side encryption (AES-256)
Public access completely blocked

Security and durability are non-negotiable in production.

3. Lambda Function (Image Processor)

IAM role with least-privilege permissions:
- CloudWatch logs
- S3 GetObject, PutObject, and version access
- CloudWatch custom metric publishing
S3 event notifications wired to Lambda
Explicit Lambda permissions for S3 invocation

4. Lambda Layer (Pillow + Dependencies)

Python dependencies packaged as a Lambda Layer
Built using Docker to ensure Linux runtime compatibility
Avoids local OS and architecture issues

5. CloudWatch Metric Filters

Converts log events into custom metrics
Enables application-level observability beyond default Lambda metrics

6. Dashboards and Alarms

CloudWatch dashboard with multiple widgets
Alarms on both standard and custom metrics
SNS integration for automated alerting

Building the Lambda Layer (Docker-Based)

Native Python dependencies like Pillow can break if built on the wrong OS or architecture. To avoid this, the project builds the layer inside Docker.

Flow:

Docker runs a Python 3.12 Linux container
Dependencies are installed inside the container
Output is a pillow-layer.zip
Terraform uploads the zip as a Lambda Layer

This approach ensures 100% compatibility with the Lambda runtime and avoids painful production failures.

Custom Metrics Using Log Metric Filters

Default Lambda metrics are useful—but not enough.

This project adds application-aware observability using custom metrics derived from logs.

Key Custom Metrics

Image processing errors
- Triggered when log level is ERROR
Processing time
- Extracted from structured logs
- Enables average and max latency tracking
Successful image processing count
Original image size
- Helps detect unusually large uploads
S3 access denied events
- Surfaced via CloudTrail → CloudWatch Logs

Each metric filter transforms log patterns into numeric CloudWatch metrics, enabling alarms and dashboards.

CloudWatch Dashboards (Operational Visibility)

The dashboard provides a single-pane-of-glass view of the system:

Included Widgets

Lambda invocations, errors, throttles
Duration metrics (average, max, P99)
Concurrent executions
Custom image processing time
Custom image size trends
Recent error logs (CloudWatch Logs Insights)

This allows operators to understand health, performance, and failures at a glance.

Alarms, SNS Topics, and Notification Flow

Alerts are categorized by severity:

🔴 Critical Alerts

Repeated image processing failures
Unauthorized access attempts
Sent to critical SNS topic (email/SMS)

🟠 Performance Alerts

High latency (P99)
Excessive concurrent executions

🟡 Log-Based Alerts

Access denied events
Anomalous log patterns

Each SNS topic is explicitly permitted to receive CloudWatch alarm events, and subscriptions must be confirmed via email.

Practical Deployment Steps

Install prerequisites:
- Terraform
- AWS CLI (configured)
- Docker
Build Lambda layer:
```
scripts/build_layer_docker.sh
```
Configure variables:
- Region
- Environment
- Project name
- Alert email addresses
Initialize Terraform:
```
terraform init
```

Plan and apply:


terraform plan
terraform apply -auto-approve

Confirm SNS subscriptions via email
Test by uploading images
Destroy resources when done:
```
terraform destroy
```

Testing Common Alarm Scenarios

I validated the observability stack by intentionally triggering failures:

Uploading valid images → success metrics only
Uploading non-image files → error metrics and alarms
Concurrent uploads → concurrency alarms
Oversized files → size-based alerts
Unauthorized access simulations → security alerts

This confirms the system fails loudly and visibly, which is exactly what production observability should do.

Troubleshooting Lessons Learned

Backend errors usually mean S3 permissions or naming issues
Docker-based layer builds are far more reliable than local builds
CloudWatch metrics can take a few minutes to appear
Metric filter patterns must exactly match log structure
Unconfirmed SNS subscriptions silently drop alerts

These are real-world lessons, not textbook theory.

Diagram

Conclusion

This project reinforced a critical DevOps principle:

If you can’t observe it, you can’t operate it.

By implementing observability programmatically with Terraform, monitoring becomes a first-class citizen of infrastructure—not an afterthought.

Logs, metrics, dashboards, alarms, and notifications together create a proactive safety net for serverless applications.

Here is the session link:

Search This Blog

Terraform with AWS (day-01)

AWS with Terraform (Day 23)

End-to-End Observability on AWS Using Terraform

A Real-World Serverless Project (Day 23 Completed)

Project Overview: Serverless Image Processing Pipeline

Why Build Observability the Terraform Way

Terraform Project Structure (Production-Style)

Key Modules

1. SNS Notifications

2. S3 Buckets

3. Lambda Function (Image Processor)

4. Lambda Layer (Pillow + Dependencies)

5. CloudWatch Metric Filters

6. Dashboards and Alarms

Building the Lambda Layer (Docker-Based)

Custom Metrics Using Log Metric Filters

Key Custom Metrics

CloudWatch Dashboards (Operational Visibility)

Included Widgets

Alarms, SNS Topics, and Notification Flow

🔴 Critical Alerts

🟠 Performance Alerts

🟡 Log-Based Alerts

Practical Deployment Steps

Testing Common Alarm Scenarios

Troubleshooting Lessons Learned

Diagram

Conclusion

Comments

Post a Comment

Popular posts from this blog

AWS with Terraform (Day 01)

AWS with Terraform (Day 27)

AWS with Terraform (Day 02)

AWS with Terraform (Day 23)

End-to-End Observability on AWS Using Terraform

A Real-World Serverless Project (Day 23 Completed)

Project Overview: Serverless Image Processing Pipeline

Why Build Observability the Terraform Way

Terraform Project Structure (Production-Style)

Key Modules

1. SNS Notifications

2. S3 Buckets

3. Lambda Function (Image Processor)

4. Lambda Layer (Pillow + Dependencies)

5. CloudWatch Metric Filters

6. Dashboards and Alarms

Building the Lambda Layer (Docker-Based)

Custom Metrics Using Log Metric Filters

Key Custom Metrics

CloudWatch Dashboards (Operational Visibility)

Included Widgets

Alarms, SNS Topics, and Notification Flow

🔴 Critical Alerts

🟠 Performance Alerts

🟡 Log-Based Alerts

Practical Deployment Steps

Testing Common Alarm Scenarios

Troubleshooting Lessons Learned

Diagram Conclusion

Comments

Post a Comment

Popular posts from this blog

AWS with Terraform (Day 01)

AWS with Terraform (Day 27)

AWS with Terraform (Day 02)

Diagram

Conclusion