AWS with Terraform (Day 24)
Highly Available and Scalable Architecture on AWS with Terraform
A Production-Minded Two-Tier Design
As a DevOps engineer, one of the most important skills is designing systems that don’t fail when things go wrong. High availability, fault tolerance, and scalability are not optional—they are baseline expectations in modern cloud environments.
On Day 24 of my hands-on DevOps journey, I built and automated a highly available and scalable web application architecture on AWS using Terraform.
The focus was not just “making it work,” but making it production-ready, secure, and resilient.
This blog breaks down the architecture, traffic flow, Terraform structure, scaling strategy, and operational lessons from this project.
Project Overview
The goal of this project was to host a containerized web application on AWS with:
-
High availability across multiple Availability Zones
-
Backend instances fully private (no public IPs)
-
Automated scaling based on demand
-
Secure and controlled internet access
-
Fully provisioned using Terraform (Infrastructure as Code)
At a high level, the architecture consists of:
-
Application Load Balancer (ALB) for incoming traffic
-
Auto Scaling Group (ASG) of EC2 instances running Docker containers
-
Private subnets for backend instances
-
NAT Gateways for controlled outbound internet access
Core Architecture Components
1. VPC and Networking
-
Custom VPC with:
-
DNS hostnames enabled
-
DNS support enabled
-
-
Public subnets (one per AZ):
-
Host NAT Gateways
-
Expose public endpoints (ALB)
-
-
Private subnets (one per AZ):
-
Host EC2 instances running application containers
-
This separation ensures backend services are isolated from direct internet access.
2. Internet Gateway and NAT Gateways
-
Internet Gateway (IGW) attached to the VPC
-
One NAT Gateway per Availability Zone
-
Each NAT Gateway has its own Elastic IP
-
Enables outbound access for private instances
-
Why one NAT per AZ?
-
Avoids single points of failure
-
Prevents cross-AZ dependency
-
Improves resilience and routing efficiency
3. Application Load Balancer (ALB)
-
Public-facing Application Load Balancer
-
Listens on:
-
Port 80 (HTTP)
-
Port 443 (HTTPS, optional with ACM)
-
-
Routes traffic to backend EC2 instances in private subnets
-
Performs health checks to ensure only healthy instances receive traffic
The ALB is the only public entry point to the system.
4. Auto Scaling Group (ASG)
-
EC2 instances distributed across multiple AZs
-
Backed by a Launch Template
-
Automatically replaces unhealthy instances
-
Scales horizontally based on CloudWatch metrics
This ensures the application can handle traffic spikes and recover from failures automatically.
How Traffic Flows Through the System
-
User sends a request over the internet
-
Request hits the Application Load Balancer
-
ALB forwards traffic to healthy EC2 instances in private subnets
-
Instances process the request and return the response via the ALB
-
For outbound needs (Docker image pulls, updates), instances use NAT Gateways
At no point are backend instances directly exposed to the internet.
Terraform Resource Layout
For clarity and maintainability, the Terraform code is split into logical files:
-
provider.tf – AWS provider configuration and global tags
-
vpc.tf – VPC, subnets, route tables, and IGW
-
nat.tf – NAT Gateways and Elastic IPs (per AZ)
-
alb.tf – ALB, listeners, and ALB security group
-
target_groups.tf – Target groups and health checks
-
asg.tf – Launch template, ASG, scaling policies, CloudWatch alarms
-
security_groups.tf – Security groups for ALB and EC2 instances
This structure mirrors how production Terraform repositories are typically organized.
Security Group Design (Critical in Production)
ALB Security Group
-
Inbound:
-
HTTP (80) from
0.0.0.0/0 -
HTTPS (443) from
0.0.0.0/0
-
-
Outbound:
-
Allow all
-
EC2 Security Group
-
Inbound:
-
Port 80 only from ALB security group
-
-
Optional SSH:
-
Allowed only from specific admin IPs or VPN ranges
-
Never
0.0.0.0/0in production
-
This ensures backend instances are reachable only through the load balancer.
Launch Template and User Data Automation
The Launch Template defines:
-
AMI
-
Instance type
-
Attached security groups
-
User data script
User Data Responsibilities
-
Install Docker
-
Pull application image from a registry
-
Run the container
-
Map host port 80 → container port 8000
This guarantees that any new instance launched by the ASG is production-ready without manual intervention.
Target Groups, Listeners, and Health Checks
Target groups manage backend registration and health monitoring.
Typical configuration:
-
Protocol: HTTP
-
Health check path:
/or/health -
Interval: 30 seconds
-
Healthy threshold: 2
-
Unhealthy threshold: tuned based on tolerance
Listeners forward traffic from the ALB to the target group.
For HTTPS, an ACM certificate can be attached to the HTTPS listener.
Auto Scaling and Scaling Policies
The ASG is configured with clear capacity boundaries:
-
Minimum capacity: 1–2 instances
-
Desired capacity: 2 instances
-
Maximum capacity: 5 instances
Scaling Strategy
-
Target tracking policy
-
Maintain average CPU utilization at ~70%
-
-
CloudWatch alarms
-
Scale out when CPU > 80%
-
Scale in when CPU < 20%
-
This provides automatic horizontal scaling based on load.
Testing and Troubleshooting Lessons
During testing, several real-world considerations came up:
-
Health checks may take time to stabilize—be patient
-
Private instances will not have public IPs (this is by design)
-
Use:
-
Session Manager
-
Bastion host
-
EC2 Instance Connect endpoint
-
-
If targets are unhealthy:
-
Check user data logs
-
Verify Docker and container ports
-
-
Scaling issues usually point to:
-
Incorrect metric configuration
-
Wrong alarm period or threshold
-
These are the kinds of issues DevOps engineers deal with daily in production.
Cost Awareness and Cleanup
Key cost drivers:
-
Application Load Balancer
-
NAT Gateways (hourly + data processing)
Always clean up after testing:
Cost awareness is part of operational responsibility.
Best Practices and Next Steps
-
Keep backend instances private
-
Use one NAT Gateway per AZ
-
Use meaningful health endpoints
-
Avoid public SSH access
-
Tag all resources consistently
Diagram
Summary
This project demonstrates a clean, scalable, and highly available AWS architecture, fully automated with Terraform.
By combining ALB, private EC2 instances, Auto Scaling Groups, and per-AZ NAT Gateways, the design achieves resilience, security, and operational simplicity.
Day 24 completed.
One step closer to building and operating real-world production systems
Here is the session link:
Comments
Post a Comment