AWS with Terraform (Day 24)

Highly Available and Scalable Architecture on AWS with Terraform

A Production-Minded Two-Tier Design

As a DevOps engineer, one of the most important skills is designing systems that don’t fail when things go wrong. High availability, fault tolerance, and scalability are not optional—they are baseline expectations in modern cloud environments.

On Day 24 of my hands-on DevOps journey, I built and automated a highly available and scalable web application architecture on AWS using Terraform.
The focus was not just “making it work,” but making it production-ready, secure, and resilient.

This blog breaks down the architecture, traffic flow, Terraform structure, scaling strategy, and operational lessons from this project.


Project Overview

The goal of this project was to host a containerized web application on AWS with:

  • High availability across multiple Availability Zones

  • Backend instances fully private (no public IPs)

  • Automated scaling based on demand

  • Secure and controlled internet access

  • Fully provisioned using Terraform (Infrastructure as Code)

At a high level, the architecture consists of:

  • Application Load Balancer (ALB) for incoming traffic

  • Auto Scaling Group (ASG) of EC2 instances running Docker containers

  • Private subnets for backend instances

  • NAT Gateways for controlled outbound internet access


Core Architecture Components

1. VPC and Networking

  • Custom VPC with:

    • DNS hostnames enabled

    • DNS support enabled

  • Public subnets (one per AZ):

    • Host NAT Gateways

    • Expose public endpoints (ALB)

  • Private subnets (one per AZ):

    • Host EC2 instances running application containers

This separation ensures backend services are isolated from direct internet access.


2. Internet Gateway and NAT Gateways

  • Internet Gateway (IGW) attached to the VPC

  • One NAT Gateway per Availability Zone

    • Each NAT Gateway has its own Elastic IP

    • Enables outbound access for private instances

Why one NAT per AZ?

  • Avoids single points of failure

  • Prevents cross-AZ dependency

  • Improves resilience and routing efficiency


3. Application Load Balancer (ALB)

  • Public-facing Application Load Balancer

  • Listens on:

    • Port 80 (HTTP)

    • Port 443 (HTTPS, optional with ACM)

  • Routes traffic to backend EC2 instances in private subnets

  • Performs health checks to ensure only healthy instances receive traffic

The ALB is the only public entry point to the system.


4. Auto Scaling Group (ASG)

  • EC2 instances distributed across multiple AZs

  • Backed by a Launch Template

  • Automatically replaces unhealthy instances

  • Scales horizontally based on CloudWatch metrics

This ensures the application can handle traffic spikes and recover from failures automatically.


How Traffic Flows Through the System

  1. User sends a request over the internet

  2. Request hits the Application Load Balancer

  3. ALB forwards traffic to healthy EC2 instances in private subnets

  4. Instances process the request and return the response via the ALB

  5. For outbound needs (Docker image pulls, updates), instances use NAT Gateways

At no point are backend instances directly exposed to the internet.


Terraform Resource Layout

For clarity and maintainability, the Terraform code is split into logical files:

  • provider.tf – AWS provider configuration and global tags

  • vpc.tf – VPC, subnets, route tables, and IGW

  • nat.tf – NAT Gateways and Elastic IPs (per AZ)

  • alb.tf – ALB, listeners, and ALB security group

  • target_groups.tf – Target groups and health checks

  • asg.tf – Launch template, ASG, scaling policies, CloudWatch alarms

  • security_groups.tf – Security groups for ALB and EC2 instances

This structure mirrors how production Terraform repositories are typically organized.


Security Group Design (Critical in Production)

ALB Security Group

  • Inbound:

    • HTTP (80) from 0.0.0.0/0

    • HTTPS (443) from 0.0.0.0/0

  • Outbound:

    • Allow all

EC2 Security Group

  • Inbound:

    • Port 80 only from ALB security group

  • Optional SSH:

    • Allowed only from specific admin IPs or VPN ranges

    • Never 0.0.0.0/0 in production

This ensures backend instances are reachable only through the load balancer.


Launch Template and User Data Automation

The Launch Template defines:

  • AMI

  • Instance type

  • Attached security groups

  • User data script

User Data Responsibilities

  • Install Docker

  • Pull application image from a registry

  • Run the container

  • Map host port 80 → container port 8000

This guarantees that any new instance launched by the ASG is production-ready without manual intervention.


Target Groups, Listeners, and Health Checks

Target groups manage backend registration and health monitoring.

Typical configuration:

  • Protocol: HTTP

  • Health check path: / or /health

  • Interval: 30 seconds

  • Healthy threshold: 2

  • Unhealthy threshold: tuned based on tolerance

Listeners forward traffic from the ALB to the target group.
For HTTPS, an ACM certificate can be attached to the HTTPS listener.


Auto Scaling and Scaling Policies

The ASG is configured with clear capacity boundaries:

  • Minimum capacity: 1–2 instances

  • Desired capacity: 2 instances

  • Maximum capacity: 5 instances

Scaling Strategy

  • Target tracking policy

    • Maintain average CPU utilization at ~70%

  • CloudWatch alarms

    • Scale out when CPU > 80%

    • Scale in when CPU < 20%

This provides automatic horizontal scaling based on load.


Testing and Troubleshooting Lessons

During testing, several real-world considerations came up:

  • Health checks may take time to stabilize—be patient

  • Private instances will not have public IPs (this is by design)

  • Use:

    • Session Manager

    • Bastion host

    • EC2 Instance Connect endpoint

  • If targets are unhealthy:

    • Check user data logs

    • Verify Docker and container ports

  • Scaling issues usually point to:

    • Incorrect metric configuration

    • Wrong alarm period or threshold

These are the kinds of issues DevOps engineers deal with daily in production.


Cost Awareness and Cleanup

Key cost drivers:

  • Application Load Balancer

  • NAT Gateways (hourly + data processing)

Always clean up after testing:

terraform destroy -auto-approve

Cost awareness is part of operational responsibility.


Best Practices and Next Steps

  • Keep backend instances private

  • Use one NAT Gateway per AZ

  • Use meaningful health endpoints

  • Avoid public SSH access

  • Tag all resources consistently


Diagram




Summary

This project demonstrates a clean, scalable, and highly available AWS architecture, fully automated with Terraform.

By combining ALB, private EC2 instances, Auto Scaling Groups, and per-AZ NAT Gateways, the design achieves resilience, security, and operational simplicity.

Day 24 completed.

One step closer to building and operating real-world production systems 

Here is the session link: 


Comments

Popular posts from this blog

AWS with Terraform (Day 01)

AWS with Terraform (Day 27)

AWS with Terraform (Day 02)