Highly Available and Scalable Architecture on AWS with Terraform

A Production-Minded Two-Tier Design

As a DevOps engineer, one of the most important skills is designing systems that don’t fail when things go wrong. High availability, fault tolerance, and scalability are not optional—they are baseline expectations in modern cloud environments.

On Day 24 of my hands-on DevOps journey, I built and automated a highly available and scalable web application architecture on AWS using Terraform.
The focus was not just “making it work,” but making it production-ready, secure, and resilient.

This blog breaks down the architecture, traffic flow, Terraform structure, scaling strategy, and operational lessons from this project.

Project Overview

The goal of this project was to host a containerized web application on AWS with:

High availability across multiple Availability Zones
Backend instances fully private (no public IPs)
Automated scaling based on demand
Secure and controlled internet access
Fully provisioned using Terraform (Infrastructure as Code)

At a high level, the architecture consists of:

Application Load Balancer (ALB) for incoming traffic
Auto Scaling Group (ASG) of EC2 instances running Docker containers
Private subnets for backend instances
NAT Gateways for controlled outbound internet access

Core Architecture Components

1. VPC and Networking

Custom VPC with:
- DNS hostnames enabled
- DNS support enabled
Public subnets (one per AZ):
- Host NAT Gateways
- Expose public endpoints (ALB)
Private subnets (one per AZ):
- Host EC2 instances running application containers

This separation ensures backend services are isolated from direct internet access.

2. Internet Gateway and NAT Gateways

Internet Gateway (IGW) attached to the VPC
One NAT Gateway per Availability Zone
- Each NAT Gateway has its own Elastic IP
- Enables outbound access for private instances

Why one NAT per AZ?

Avoids single points of failure
Prevents cross-AZ dependency
Improves resilience and routing efficiency

3. Application Load Balancer (ALB)

Public-facing Application Load Balancer
Listens on:
- Port 80 (HTTP)
- Port 443 (HTTPS, optional with ACM)
Routes traffic to backend EC2 instances in private subnets
Performs health checks to ensure only healthy instances receive traffic

The ALB is the only public entry point to the system.

4. Auto Scaling Group (ASG)

EC2 instances distributed across multiple AZs
Backed by a Launch Template
Automatically replaces unhealthy instances
Scales horizontally based on CloudWatch metrics

This ensures the application can handle traffic spikes and recover from failures automatically.

How Traffic Flows Through the System

User sends a request over the internet
Request hits the Application Load Balancer
ALB forwards traffic to healthy EC2 instances in private subnets
Instances process the request and return the response via the ALB
For outbound needs (Docker image pulls, updates), instances use NAT Gateways

At no point are backend instances directly exposed to the internet.

Terraform Resource Layout

For clarity and maintainability, the Terraform code is split into logical files:

provider.tf – AWS provider configuration and global tags
vpc.tf – VPC, subnets, route tables, and IGW
nat.tf – NAT Gateways and Elastic IPs (per AZ)
alb.tf – ALB, listeners, and ALB security group
target_groups.tf – Target groups and health checks
asg.tf – Launch template, ASG, scaling policies, CloudWatch alarms
security_groups.tf – Security groups for ALB and EC2 instances

This structure mirrors how production Terraform repositories are typically organized.

Security Group Design (Critical in Production)

ALB Security Group

Inbound:
- HTTP (80) from 0.0.0.0/0
- HTTPS (443) from 0.0.0.0/0
Outbound:
- Allow all

EC2 Security Group

Inbound:
- Port 80 only from ALB security group
Optional SSH:
- Allowed only from specific admin IPs or VPN ranges
- Never 0.0.0.0/0 in production

This ensures backend instances are reachable only through the load balancer.

Launch Template and User Data Automation

The Launch Template defines:

AMI
Instance type
Attached security groups
User data script

User Data Responsibilities

Install Docker
Pull application image from a registry
Run the container
Map host port 80 → container port 8000

This guarantees that any new instance launched by the ASG is production-ready without manual intervention.

Target Groups, Listeners, and Health Checks

Target groups manage backend registration and health monitoring.

Typical configuration:

Protocol: HTTP
Health check path: / or /health
Interval: 30 seconds
Healthy threshold: 2
Unhealthy threshold: tuned based on tolerance

Listeners forward traffic from the ALB to the target group.
For HTTPS, an ACM certificate can be attached to the HTTPS listener.

Auto Scaling and Scaling Policies

The ASG is configured with clear capacity boundaries:

Minimum capacity: 1–2 instances
Desired capacity: 2 instances
Maximum capacity: 5 instances

Scaling Strategy

Target tracking policy
- Maintain average CPU utilization at ~70%
CloudWatch alarms
- Scale out when CPU > 80%
- Scale in when CPU < 20%

This provides automatic horizontal scaling based on load.

Testing and Troubleshooting Lessons

During testing, several real-world considerations came up:

Health checks may take time to stabilize—be patient
Private instances will not have public IPs (this is by design)
Use:
- Session Manager
- Bastion host
- EC2 Instance Connect endpoint
If targets are unhealthy:
- Check user data logs
- Verify Docker and container ports
Scaling issues usually point to:
- Incorrect metric configuration
- Wrong alarm period or threshold

These are the kinds of issues DevOps engineers deal with daily in production.

Cost Awareness and Cleanup

Key cost drivers:

Application Load Balancer
NAT Gateways (hourly + data processing)

Always clean up after testing:


terraform destroy -auto-approve

Cost awareness is part of operational responsibility.

Best Practices and Next Steps

Keep backend instances private
Use one NAT Gateway per AZ
Use meaningful health endpoints
Avoid public SSH access
Tag all resources consistently

Diagram

Summary

This project demonstrates a clean, scalable, and highly available AWS architecture, fully automated with Terraform.

By combining ALB, private EC2 instances, Auto Scaling Groups, and per-AZ NAT Gateways, the design achieves resilience, security, and operational simplicity.

Day 24 completed.

One step closer to building and operating real-world production systems

Here is the session link:

Search This Blog

Terraform with AWS (day-01)

AWS with Terraform (Day 24)

Highly Available and Scalable Architecture on AWS with Terraform

A Production-Minded Two-Tier Design

Project Overview

Core Architecture Components

1. VPC and Networking

2. Internet Gateway and NAT Gateways

3. Application Load Balancer (ALB)

4. Auto Scaling Group (ASG)

How Traffic Flows Through the System

Terraform Resource Layout

Security Group Design (Critical in Production)

ALB Security Group

EC2 Security Group

Launch Template and User Data Automation

User Data Responsibilities

Target Groups, Listeners, and Health Checks

Auto Scaling and Scaling Policies

Scaling Strategy

Testing and Troubleshooting Lessons

Cost Awareness and Cleanup

Best Practices and Next Steps

Diagram

Summary

Comments

Post a Comment

Popular posts from this blog

AWS with Terraform (Day 01)

AWS with Terraform (Day 27)

AWS with Terraform (Day 02)

AWS with Terraform (Day 24)

Highly Available and Scalable Architecture on AWS with Terraform

A Production-Minded Two-Tier Design

Project Overview

Core Architecture Components

1. VPC and Networking

2. Internet Gateway and NAT Gateways

3. Application Load Balancer (ALB)

4. Auto Scaling Group (ASG)

How Traffic Flows Through the System

Terraform Resource Layout

Security Group Design (Critical in Production)

ALB Security Group

EC2 Security Group

Launch Template and User Data Automation

User Data Responsibilities

Target Groups, Listeners, and Health Checks

Auto Scaling and Scaling Policies

Scaling Strategy

Testing and Troubleshooting Lessons

Cost Awareness and Cleanup

Best Practices and Next Steps

DiagramSummary

Comments

Post a Comment

Popular posts from this blog

AWS with Terraform (Day 01)

AWS with Terraform (Day 27)

AWS with Terraform (Day 02)

Diagram

Summary