AWS with Terraform (Day 23)
End-to-End Observability on AWS Using Terraform
A Real-World Serverless Project (Day 23 Completed)
As a DevOps engineer, I’ve learned that building applications is only half the job. The real challenge begins when those applications are running in production. If you can’t observe your system—logs, metrics, alerts, failures—you’re flying blind.
On Day 23 of my hands-on DevOps journey, I completed an end-to-end observability stack on AWS using Terraform for a real-world serverless image-processing application.
This project focuses on building production-grade monitoring, logging, dashboards, alarms, and notifications, all fully automated and reproducible.
This blog walks through what I built, why I built it, and how it works in practice.
Project Overview: Serverless Image Processing Pipeline
At the core of this project is a simple but realistic serverless workflow:
-
A user uploads an image to an S3 upload bucket
-
An AWS Lambda function is triggered
-
The function processes the image into multiple formats:
-
WEBP
-
JPG
-
PNG
-
Thumbnail
-
Additional resized variants
-
-
Processed images are stored in a separate S3 processed bucket
-
Observability is layered on top using CloudWatch and SNS
This mirrors real-world serverless workloads where visibility, performance, and error detection are critical.
Why Build Observability the Terraform Way
I deliberately implemented observability using Terraform (Infrastructure as Code) instead of manual console configuration.
Why?
-
Reproducibility – The entire monitoring stack can be recreated in minutes
-
Auditability – Dashboards, alarms, and thresholds are version-controlled
-
No configuration drift – No hidden console changes
-
Environment parity – Same observability for dev, staging, and prod
-
Production realism – This is how mature teams manage observability
Observability should not be an afterthought—it should be part of the infrastructure itself.
Terraform Project Structure (Production-Style)
The project is organized using reusable Terraform modules, closely resembling real production repositories.
Key Modules
1. SNS Notifications
-
Three SNS topics:
-
Critical alerts
-
Performance alerts
-
Log-based alerts
-
-
Email subscriptions (SMS optional)
-
SNS topic policies allowing CloudWatch to publish events
2. S3 Buckets
-
Separate upload and processed buckets
-
Versioning enabled
-
Server-side encryption (AES-256)
-
Public access completely blocked
Security and durability are non-negotiable in production.
3. Lambda Function (Image Processor)
-
IAM role with least-privilege permissions:
-
CloudWatch logs
-
S3
GetObject,PutObject, and version access -
CloudWatch custom metric publishing
-
-
S3 event notifications wired to Lambda
-
Explicit Lambda permissions for S3 invocation
4. Lambda Layer (Pillow + Dependencies)
-
Python dependencies packaged as a Lambda Layer
-
Built using Docker to ensure Linux runtime compatibility
-
Avoids local OS and architecture issues
5. CloudWatch Metric Filters
-
Converts log events into custom metrics
-
Enables application-level observability beyond default Lambda metrics
6. Dashboards and Alarms
-
CloudWatch dashboard with multiple widgets
-
Alarms on both standard and custom metrics
-
SNS integration for automated alerting
Building the Lambda Layer (Docker-Based)
Native Python dependencies like Pillow can break if built on the wrong OS or architecture. To avoid this, the project builds the layer inside Docker.
Flow:
-
Docker runs a Python 3.12 Linux container
-
Dependencies are installed inside the container
-
Output is a
pillow-layer.zip -
Terraform uploads the zip as a Lambda Layer
This approach ensures 100% compatibility with the Lambda runtime and avoids painful production failures.
Custom Metrics Using Log Metric Filters
Default Lambda metrics are useful—but not enough.
This project adds application-aware observability using custom metrics derived from logs.
Key Custom Metrics
-
Image processing errors
-
Triggered when log level is
ERROR
-
-
Processing time
-
Extracted from structured logs
-
Enables average and max latency tracking
-
-
Successful image processing count
-
Original image size
-
Helps detect unusually large uploads
-
-
S3 access denied events
-
Surfaced via CloudTrail → CloudWatch Logs
-
Each metric filter transforms log patterns into numeric CloudWatch metrics, enabling alarms and dashboards.
CloudWatch Dashboards (Operational Visibility)
The dashboard provides a single-pane-of-glass view of the system:
Included Widgets
-
Lambda invocations, errors, throttles
-
Duration metrics (average, max, P99)
-
Concurrent executions
-
Custom image processing time
-
Custom image size trends
-
Recent error logs (CloudWatch Logs Insights)
This allows operators to understand health, performance, and failures at a glance.
Alarms, SNS Topics, and Notification Flow
Alerts are categorized by severity:
🔴 Critical Alerts
-
Repeated image processing failures
-
Unauthorized access attempts
-
Sent to critical SNS topic (email/SMS)
🟠Performance Alerts
-
High latency (P99)
-
Excessive concurrent executions
🟡 Log-Based Alerts
-
Access denied events
-
Anomalous log patterns
Each SNS topic is explicitly permitted to receive CloudWatch alarm events, and subscriptions must be confirmed via email.
Practical Deployment Steps
-
Install prerequisites:
-
Terraform
-
AWS CLI (configured)
-
Docker
-
-
Build Lambda layer:
-
Configure variables:
-
Region
-
Environment
-
Project name
-
Alert email addresses
-
-
Initialize Terraform:
-
Plan and apply:
-
Confirm SNS subscriptions via email
-
Test by uploading images
-
Destroy resources when done:
Testing Common Alarm Scenarios
I validated the observability stack by intentionally triggering failures:
-
Uploading valid images → success metrics only
-
Uploading non-image files → error metrics and alarms
-
Concurrent uploads → concurrency alarms
-
Oversized files → size-based alerts
-
Unauthorized access simulations → security alerts
This confirms the system fails loudly and visibly, which is exactly what production observability should do.
Troubleshooting Lessons Learned
-
Backend errors usually mean S3 permissions or naming issues
-
Docker-based layer builds are far more reliable than local builds
-
CloudWatch metrics can take a few minutes to appear
-
Metric filter patterns must exactly match log structure
-
Unconfirmed SNS subscriptions silently drop alerts
These are real-world lessons, not textbook theory.
Diagram
Conclusion
This project reinforced a critical DevOps principle:
If you can’t observe it, you can’t operate it.
By implementing observability programmatically with Terraform, monitoring becomes a first-class citizen of infrastructure—not an afterthought.
Logs, metrics, dashboards, alarms, and notifications together create a proactive safety net for serverless applications.
Here is the session link:
Comments
Post a Comment