Tag Archives: Terraform

Securely Scale AWS with Terraform Sentinel Policy

01/30/2026 HuuPV Leave a comment

In high-velocity engineering organizations, the “move fast and break things” mantra often collides violently with security compliance and cost governance. As you scale AWS infrastructure using Infrastructure as Code (IaC), manual code reviews become the primary bottleneck. For expert practitioners utilizing Terraform Cloud or Enterprise, the solution isn’t slowing down-it’s automating governance. This is the domain of Terraform Sentinel Policy.

Sentinel is HashiCorp’s embedded policy-as-code framework. Unlike external linting tools that check syntax, Sentinel sits directly in the provisioning path, intercepting the Terraform plan before execution. It allows SREs and Platform Engineers to define granular, logic-based guardrails that enforce CIS benchmarks, limit blast radius, and control costs without hindering developer velocity. In this guide, we will bypass the basics and dissect how to architect, write, and test advanced Sentinel policies for enterprise-grade AWS environments.

The Architecture of Policy Enforcement

To leverage Terraform Sentinel Policy effectively, one must understand where it lives in the lifecycle. Sentinel runs in a sandboxed environment within the Terraform Cloud/Enterprise execution layer. It does not have direct access to the internet or your cloud provider APIs; instead, it relies on imports to make decisions based on context.

When a run is triggered:

Plan Phase: Terraform generates the execution plan.
Policy Check: Sentinel evaluates the plan against your defined policy sets.
Decision: The run is allowed, halted (Hard Mandatory), or flagged for override (Soft Mandatory).
Apply Phase: Provisioning occurs only if the policy check passes.

Pro-Tip: The tfplan/v2 import is the standard for accessing resource data. Avoid the legacy tfplan import as it lacks the detailed resource changes structure required for complex AWS resource evaluations.

Anatomy of an AWS Sentinel Policy

A robust policy typically consists of three phases: Imports, Filtering, and Evaluation. Let’s examine a scenario where we must ensure all AWS S3 buckets have server-side encryption enabled.

1. The Setup

First, we define our imports and useful helper functions to filter the plan for specific resource types.

import "tfplan/v2" as tfplan

# Filter resources by type
get_resources = func(type) {
  resources = {}
  for tfplan.resource_changes as address, rc {
    if rc.type is type and
       (rc.change.actions contains "create" or rc.change.actions contains "update") {
      resources[address] = rc
    }
  }
  return resources
}

# Fetch all S3 Buckets
s3_buckets = get_resources("aws_s3_bucket")

2. The Logic Rule

Next, we iterate through the filtered resources to validate their configuration. Note the use of the all quantifier, which ensures the rule returns true only if every instance passes the check.

# Rule: specific encryption configuration check
encryption_enforced = rule {
  all s3_buckets as _, bucket {
    keys(bucket.change.after) contains "server_side_encryption_configuration" and
    length(bucket.change.after.server_side_encryption_configuration) > 0
  }
}

# Main Rule
main = rule {
  encryption_enforced
}

This policy inspects the after state—the predicted state of the resource after the apply—ensuring that we are validating the final outcome, not just the code written in main.tf.

Advanced AWS Scaling Patterns

Scaling securely on AWS requires more than just resource configuration checks. It requires context-aware policies. Here are two advanced patterns for expert SREs.

Pattern 1: Cost Control via Instance Type Allow-Listing

To prevent accidental provisioning of expensive x1e.32xlarge instances, use a policy that compares requested types against an allowed list.

# Allowed EC2 types
allowed_types = ["t3.micro", "t3.small", "m5.large"]

# Check function
instance_type_allowed = rule {
  all get_resources("aws_instance") as _, instance {
    instance.change.after.instance_type in allowed_types
  }
}

Pattern 2: Enforcing Mandatory Tags for Cost Allocation

At scale, untagged resources are “ghost resources.” You can enforce that every AWS resource created carries specific tags (e.g., CostCenter, Environment).

mandatory_tags = ["CostCenter", "Environment"]

validate_tags = rule {
  all get_resources("aws_instance") as _, instance {
    all mandatory_tags as t {
      keys(instance.change.after.tags) contains t
    }
  }
}

Testing and Mocking Policies

Writing policy is development. Therefore, it requires testing. You should never push a Terraform Sentinel Policy to production without verifying it against mock data.

Use the Sentinel CLI to generate mocks from real Terraform plans:

$ terraform plan -out=tfplan
$ terraform show -json tfplan > plan.json
$ sentinel apply -trace policy.sentinel

By creating a suite of test cases (passing and failing mocks), you can integrate policy testing into your CI/CD pipeline, ensuring that a change to the governance logic doesn’t accidentally block legitimate deployments.

Enforcement Levels: The Deployment Strategy

When rolling out new policies, avoid the “Big Bang” approach. Sentinel offers three enforcement levels:

Advisory: Logs a warning but allows the run to proceed. Ideal for testing new policies in production without impact.
Soft Mandatory: Halts the run but allows administrators to override. Useful for edge cases where human judgment is required.
Hard Mandatory: Halts the run explicitly. No overrides. Use this for strict security violations (e.g., public S3 buckets, open security group 0.0.0.0/0).

Frequently Asked Questions (FAQ)

How does Sentinel differ from OPA (Open Policy Agent)?

While OPA is a general-purpose policy engine using Rego, Sentinel is embedded deeply into the HashiCorp ecosystem. Sentinel’s integration with Terraform Cloud allows it to access data from the Plan, Configuration, and State without complex external setups. However, OPA is often used for Kubernetes (Gatekeeper), whereas Sentinel excels in the Terraform layer.

Can I access cost estimates in my policy?

Yes. Terraform Cloud generates a cost estimate for every plan. By importing tfrun, you can write policies that deny infrastructure changes if the delta in monthly cost exceeds a certain threshold (e.g., increasing the bill by more than $500/month).

Does Sentinel affect the performance of Terraform runs?

Sentinel executes after the plan is calculated. While the execution time of the policy itself is usually negligible (milliseconds to seconds), extensive API calls within the policy (if using external HTTP imports) can add latency. Stick to using the standard tfplan imports for optimal performance.

Conclusion

Implementing Terraform Sentinel Policy is a definitive step towards maturity in your cloud operating model. It shifts security left, turning vague compliance documents into executable code that scales with your AWS infrastructure. By treating policy as code—authoring, testing, and versioning it—you empower your developers to deploy faster with the confidence that the guardrails will catch any critical errors.

Start small: Audit your current AWS environment, identify the top 3 risks (e.g., unencrypted volumes, open security groups), and implement them as Advisory policies today. Thank you for reading the DevopsRoles page!

Terraform

Mastering Factorio with Terraform: The Ultimate Automation Guide

01/18/2026 HuuPV Leave a comment

For the uninitiated, Factorio is a game about automation. For the Senior DevOps Engineer, it is a spiritual mirror of our daily lives. You start by manually crafting plates (manual provisioning), move to burner drills (shell scripts), and eventually build a mega-base capable of launching rockets per minute (fully automated Kubernetes clusters).

But why stop at automating the gameplay? As infrastructure experts, we know that the factory must grow, and the server hosting it should be as resilient and reproducible as the factory itself. In this guide, we will bridge the gap between gaming and professional Infrastructure as Code (IaC). We are going to deploy a high-performance, cost-optimized, and fully persistent Factorio dedicated server using Factorio with Terraform.

Why Terraform for a Game Server?

If you are reading this, you likely already know Terraform’s value proposition. However, applying it to stateful workloads like game servers presents unique challenges that test your architectural patterns.

Immutable Infrastructure: Treat the game server binary and OS as ephemeral. Only the /saves directory matters.
Cost Control: Factorio servers don’t need to run 24/7 if no one is playing. Terraform allows you to spin up the infrastructure for a weekend session and destroy it Sunday night, while preserving state.
Disaster Recovery: If your server crashes or the instance degrades, a simple terraform apply brings the factory back online in minutes.

Pro-Tip: Factorio is heavily single-threaded. When choosing your compute instance (e.g., AWS EC2), prioritize high clock speeds (GHz) over core count. An AWS c5.large or c6i.large is often superior to general-purpose instances for maintaining 60 UPS (Updates Per Second) on large mega-bases.

Architecture Overview

We will design a modular architecture on AWS, though the concepts apply to GCP, Azure, or DigitalOcean. Our stack includes:

Compute: EC2 Instance (optimized for compute).
Storage: Separate EBS volume for game saves (preventing data loss on instance termination) or an S3-sync strategy.
Network: VPC, Subnet, and Security Groups allowing UDP/34197.
Provisioning: Cloud-Init (`user_data`) to bootstrap Docker and the headless Factorio container.

Step 1: The Network & Security Layer

Factorio uses UDP port 34197 by default. Unlike HTTP services, we don’t need a complex Load Balancer; a direct public IP attachment is sufficient and reduces latency.

resource "aws_security_group" "factorio_sg" {
  name        = "factorio-allow-udp"
  description = "Allow Factorio UDP traffic"
  vpc_id      = module.vpc.vpc_id

  ingress {
    description = "Factorio Game Port"
    from_port   = 34197
    to_port     = 34197
    protocol    = "udp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "SSH Access (Strict)"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.admin_ip] # Always restrict SSH!
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Step 2: Persistent Storage Strategy

This is the most critical section. In a “Factorio with Terraform” setup, if you run terraform destroy, you must not lose the factory. We have two primary patterns:

EBS Volume Attachment: A dedicated EBS volume that exists outside the lifecycle of the EC2 instance.
S3 Sync (The Cloud-Native Way): The instance pulls the latest save from S3 on boot and pushes it back on shutdown (or via cron).

For experts, I recommend the S3 Sync pattern for true immutability. It avoids the headaches of EBS volume attachment states and availability zone constraints.

resource "aws_iam_role_policy" "factorio_s3_access" {
  name = "factorio_s3_policy"
  role = aws_iam_role.factorio_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Effect   = "Allow"
        Resource = [
          aws_s3_bucket.factorio_saves.arn,
          "${aws_s3_bucket.factorio_saves.arn}/*"
        ]
      },
    ]
  })
}

Step 3: The Compute Instance & Cloud-Init

We use the user_data field to bootstrap the environment. We will utilize the community-standard factoriotools/factorio Docker image. This image is robust and handles updates automatically.

data "template_file" "user_data" {
  template = file("${path.module}/scripts/setup.sh.tpl")

  vars = {
    bucket_name = aws_s3_bucket.factorio_saves.id
    save_file   = "my-megabase.zip"
  }
}

resource "aws_instance" "server" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "c5.large" # High single-core performance
  
  subnet_id                   = module.vpc.public_subnets[0]
  vpc_security_group_ids      = [aws_security_group.factorio_sg.id]
  iam_instance_profile        = aws_iam_instance_profile.factorio_profile.name
  user_data                   = data.template_file.user_data.rendered

  # Spot instances can save you 70% cost, but ensure you handle interruption!
  instance_market_options {
    market_type = "spot"
  }

  tags = {
    Name = "Factorio-Server"
  }
}

The Cloud-Init Script (setup.sh.tpl)

The bash script below handles the “hydrate” phase (downloading save) and the “run” phase.

#!/bin/bash
# Install Docker and AWS CLI
apt-get update && apt-get install -y docker.io awscli

# 1. Hydrate: Download latest save from S3
mkdir -p /opt/factorio/saves
aws s3 cp s3://${bucket_name}/${save_file} /opt/factorio/saves/save.zip || echo "No save found, starting fresh"

# 2. Permissions
chown -R 845:845 /opt/factorio

# 3. Run Factorio Container
docker run -d \
  -p 34197:34197/udp \
  -v /opt/factorio:/factorio \
  --name factorio \
  --restart always \
  factoriotools/factorio

# 4. Setup Auto-Save Sync (Crontab)
echo "*/5 * * * * aws s3 sync /opt/factorio/saves s3://${bucket_name}/ --delete" > /tmp/cronjob
crontab /tmp/cronjob

Advanced Concept: To prevent data loss on Spot Instance termination, listen for the EC2 Instance Termination Warning (via metadata service) and trigger a force-save and S3 upload immediately.

Managing State and Updates

One of the benefits of using Factorio with Terraform is update management. When Wube Software releases a new version of Factorio:

Update the Docker tag in your Terraform variable or Cloud-Init script.
Run terraform apply (or taint the instance).
Terraform replaces the instance.
Cloud-Init pulls the save from S3 and the new binary version.
The server is back online in 2 minutes with the latest patch.

Cost Optimization: The Weekend Warrior Pattern

Running a c5.large 24/7 can cost roughly $60-$70/month. If you only play on weekends, this is wasteful.

By wrapping your Terraform configuration in a CI/CD pipeline (like GitHub Actions), you can create a “ChatOps” workflow (e.g., via Discord slash commands). A command like /start-server triggers terraform apply, and /stop-server triggers terraform destroy. Because your state is safely in S3 (both Terraform state and Game save state), you pay $0 for compute during the week.

Frequently Asked Questions (FAQ)

Can I use Terraform to manage in-game mods?

Yes. The factoriotools/factorio image supports a mods/ directory. You can upload your mod-list.json and zip files to S3, and have the Cloud-Init script pull them alongside the save file. Alternatively, you can define the mod list as an environment variable passed into the container.

How do I handle the initial world generation?

If no save file exists in S3 (the first run), the Docker container will generate a new map based on the server-settings.json. Once generated, your cron job will upload this new save to S3, establishing the persistence loop.

Is Terraform overkill for a single server?

For a “click-ops” manual setup, maybe. But as an expert, you know that “manual” means “unmaintainable.” Terraform documents your configuration, allows for version control of your server settings, and enables effortless migration between cloud providers or regions.

Conclusion

Deploying Factorio with Terraform is more than just a fun project; it is an exercise in designing stateful, resilient applications on ephemeral infrastructure. By decoupling storage (S3) from compute (EC2) and automating the configuration via Cloud-Init, you achieve a server setup that is robust, cheap to run, and easy to upgrade.

The factory must grow, and now, your infrastructure can grow with it. Thank you for reading the DevopsRoles page!

AI Prompts, AIOps, Terraform

Deploy Generative AI with Terraform: Automated Agent Lifecycle

01/15/2026 HuuPV Leave a comment

The shift from Jupyter notebooks to production-grade infrastructure is often the “valley of death” for AI projects. While data scientists excel at model tuning, the operational reality of managing API quotas, secure context retrieval, and scalable inference endpoints requires rigorous engineering. This is where Generative AI with Terraform becomes the critical bridge between experimental code and reliable, scalable application delivery.

In this guide, we will bypass the basics of “what is IaC” and focus on architecting a robust automated lifecycle for Generative AI agents. We will cover provisioning vector databases for RAG (Retrieval-Augmented Generation), securing LLM credentials via Secrets Manager, and deploying containerized agents using Amazon ECS—all defined strictly in HCL.

The Architecture of AI-Native Infrastructure

When we talk about deploying Generative AI with Terraform, we are typically orchestrating three distinct layers. Unlike traditional web apps, AI applications require specialized state management for embeddings and massive compute bursts for inference.

Knowledge Layer (RAG): Vector databases (e.g., Pinecone, Milvus, or AWS OpenSearch) to store embeddings.
Inference Layer (Compute): Containers hosting the orchestration logic (LangChain/LlamaIndex) running on ECS, EKS, or Lambda.
Model Gateway (API): Secure interfaces to foundation models (AWS Bedrock, OpenAI, Anthropic).

Pro-Tip for SREs: Avoid managing model weights directly in Terraform state. Terraform is designed for infrastructure state, not gigabyte-sized binary blobs. Use Terraform to provision the S3 buckets and permissions, but delegate the artifact upload to your CI/CD pipeline or DVC (Data Version Control).

1. Provisioning the Knowledge Base (Vector Store)

For a RAG architecture, the vector store is your database. Below is a production-ready pattern for deploying an AWS OpenSearch Serverless collection, which serves as a highly scalable vector store compatible with LangChain.

resource "aws_opensearchserverless_collection" "agent_memory" {
  name        = "gen-ai-agent-memory"
  type        = "VECTORSEARCH"
  description = "Vector store for Generative AI embeddings"

  depends_on = [aws_opensearchserverless_security_policy.encryption]
}

resource "aws_opensearchserverless_security_policy" "encryption" {
  name        = "agent-memory-encryption"
  type        = "encryption"
  policy      = jsonencode({
    Rules = [
      {
        ResourceType = "collection"
        Resource = ["collection/gen-ai-agent-memory"]
      }
    ],
    AWSOwnedKey = true
  })
}

output "vector_endpoint" {
  value = aws_opensearchserverless_collection.agent_memory.collection_endpoint
}

This HCL snippet ensures that encryption is enabled by default—a non-negotiable requirement for enterprise AI apps handling proprietary data.

2. Securing LLM Credentials

Hardcoding API keys is a cardinal sin in DevOps, but in GenAI, it’s also a financial risk due to usage-based billing. We leverage AWS Secrets Manager to inject keys into our agent’s environment at runtime.

resource "aws_secretsmanager_secret" "openai_api_key" {
  name        = "production/gen-ai/openai-key"
  description = "API Key for OpenAI Model Access"
}

resource "aws_iam_role_policy" "ecs_task_secrets" {
  name = "ecs-task-secrets-access"
  role = aws_iam_role.ecs_task_execution_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "secretsmanager:GetSecretValue"
        Effect = "Allow"
        Resource = aws_secretsmanager_secret.openai_api_key.arn
      }
    ]
  })
}

By explicitly defining the IAM policy, we adhere to the principle of least privilege. The container hosting the AI agent can strictly access only the specific secret required for inference.

3. Deploying the Agent Runtime (ECS Fargate)

For agents that require long-running processes (e.g., maintaining WebSocket connections or processing large documents), AWS Lambda often hits timeout limits. ECS Fargate provides a serverless container environment perfect for hosting Python-based LangChain agents.

resource "aws_ecs_task_definition" "agent_task" {
  family                   = "gen-ai-agent"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 1024
  memory                   = 2048
  execution_role_arn       = aws_iam_role.ecs_task_execution_role.arn

  container_definitions = jsonencode([
    {
      name      = "agent_container"
      image     = "${aws_ecr_repository.agent_repo.repository_url}:latest"
      essential = true
      secrets   = [
        {
          name      = "OPENAI_API_KEY"
          valueFrom = aws_secretsmanager_secret.openai_api_key.arn
        }
      ]
      environment = [
        {
          name  = "VECTOR_DB_ENDPOINT"
          value = aws_opensearchserverless_collection.agent_memory.collection_endpoint
        }
      ]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/gen-ai-agent"
          "awslogs-region"        = var.aws_region
          "awslogs-stream-prefix" = "ecs"
        }
      }
    }
  ])
}

This configuration dynamically links the output of your vector store resource (created in Step 1) into the container’s environment variables. This creates a self-healing dependency graph where infrastructure updates automatically propagate to the application configuration.

4. Automating the Lifecycle with Terraform & CI/CD

Deploying Generative AI with Terraform isn’t just about the initial setup; it’s about the lifecycle. As models drift and prompts need updating, you need a pipeline that handles redeployment without downtime.

The “Blue/Green” Strategy for AI Agents

AI agents are non-deterministic. A prompt change that works for one query might break another. Implementing a Blue/Green deployment strategy using Terraform is crucial.

Infrastructure (Terraform): Defines the Load Balancer and Target Groups.
Application (CodeDeploy): Shifts traffic from the old agent version (Blue) to the new version (Green) gradually.

Using the AWS CodeDeploy Terraform resource, you can script this traffic shift to automatically rollback if error rates spike (e.g., if the LLM starts hallucinating or timing out).

Frequently Asked Questions (FAQ)

Can Terraform manage the actual LLM models?

Generally, no. Terraform is for infrastructure. While you can use Terraform to provision an Amazon SageMaker Endpoint or an EC2 instance with GPU support, the model weights themselves (the artifacts) are better managed by tools like DVC or MLflow. Terraform sets the stage; the ML pipeline puts the actors on it.

How do I handle GPU provisioning for self-hosted LLMs in Terraform?

If you are hosting open-source models (like Llama 3 or Mistral), you will need to specify instance types with GPU acceleration. In the aws_instance or aws_launch_template resource, ensure you select the appropriate instance type (e.g., g5.2xlarge or p3.2xlarge) and utilize a deeply integrated AMI (Amazon Machine Image) like the AWS Deep Learning AMI.

Is Terraform suitable for prompt management?

No. Prompts are application code/configuration, not infrastructure. Storing prompts in Terraform variables creates unnecessary friction. Store prompts in a dedicated database or as config files within your application repository.

Conclusion

Deploying Generative AI with Terraform transforms a fragile experiment into a resilient enterprise asset. By codifying the vector storage, compute environment, and security policies, you eliminate the “it works on my machine” syndrome that plagues AI development.

The code snippets provided above offer a foundational skeleton. As you scale, look into modularizing these resources into reusable Terraform Modules to empower your data science teams to spin up compliant environments on demand. Thank you for reading the DevopsRoles page!

AWS, Terraform

Mastering AWS Account Deployment: Terraform & AWS Control Tower

12/29/2025 HuuPV Leave a comment

For modern enterprises, AWS account deployment is no longer a manual task of clicking through the AWS Organizations console. As infrastructure scales, the need for consistent, compliant, and automated “vending machines” for AWS accounts becomes paramount. By combining the governance power of AWS Control Tower with the Infrastructure as Code (IaC) flexibility of Terraform, SREs and Cloud Architects can build a robust deployment pipeline that satisfies both developer velocity and security requirements.

The Foundations: Why Control Tower & Terraform?

In a decentralized cloud environment, AWS account deployment must address three critical pillars: Governance, Security, and Scalability. While AWS Control Tower provides the managed “Landing Zone” environment, Terraform provides the declarative state management required to manage thousands of resources across multiple accounts without configuration drift.

Advanced Concept: Control Tower uses “Guardrails” (Service Control Policies and Config Rules). When deploying accounts via Terraform, you aren’t just creating a container; you are attaching a policy-driven ecosystem that inherits the root organization’s security posture by default.

By leveraging the Terraform AWS Provider alongside Control Tower, you enable a “GitOps” workflow where an account request is simply a .tf file in a repository. This approach ensures that every account is born with the correct IAM roles, VPC configurations, and logging buckets pre-provisioned.

Deep Dive: Account Factory for Terraform (AFT)

The AWS Control Tower Account Factory for Terraform (AFT) is the official bridge between these two worlds. AFT sets up a separate orchestration engine that listens for Terraform changes and triggers the Control Tower account creation API.

The AFT Component Stack

AFT Management Account: A dedicated account within your Organization to host the AFT pipeline.
Request Metadata: A DynamoDB table or Git repo that stores account parameters (Email, OU, SSO user).
Customization Pipeline: A series of Step Functions and Lambda functions that apply “Global” and “Account-level” Terraform modules after the account is provisioned.

Step-by-Step: Deploying Your First Managed Account

To master AWS account deployment via AFT, you must understand the structure of an account request. Below is a production-grade example of a Terraform module call to request a new “Production” account.


module "sandbox_account" {
  source = "github.com/aws-ia/terraform-aws-control_tower_account_factory"

  control_tower_parameters = {
    AccountEmail              = "cloud-ops+prod-app-01@example.com"
    AccountName               = "production-app-01"
    ManagedOrganizationalUnit = "Production"
    SSOUserEmail              = "admin@example.com"
    SSOUserFirstName          = "Platform"
    SSOUserLastName           = "Team"
  }

  account_tags = {
    "Project"     = "Apollo"
    "Environment" = "Production"
    "CostCenter"  = "12345"
  }

  change_management_parameters = {
    change_requested_by = "DevOps Team"
    change_reason       = "New microservice deployment for Q4"
  }

  custom_fields = {
    vpc_cidr = "10.0.0.0/20"
  }
}

After applying this Terraform code, AFT triggers a workflow in the background. It calls the Control Tower ProvisionProduct API, waits for the account to be “Ready,” and then executes your post-provisioning Terraform modules to set up VPCs, IAM roles, and CloudWatch alarms.

Production-Ready Best Practices

Expert SREs know that AWS account deployment is only 20% of the battle; the other 80% is maintaining those accounts. Follow these standards:

Idempotency is King: Ensure your post-provisioning scripts can run multiple times without failure. Use Terraform’s lifecycle { prevent_destroy = true } on critical resources like S3 logging buckets.
Service Quota Management: Newly deployed accounts start with default limits. Use the aws_servicequotas_service_quota resource to automatically request increases for EC2 instances or VPCs during the deployment phase.
Region Deny Policies: Use Control Tower guardrails to restrict deployments to approved regions. This reduces your attack surface and prevents “shadow IT” in unmonitored regions like me-south-1.
Centralized Logging: Always ensure the aws_s3_bucket_policy in your log-archive account allows the newly created account’s CloudTrail service principal to write logs immediately.

Troubleshooting Common Deployment Failures

Even with automation, AWS account deployment can encounter hurdles. Here are the most common failure modes observed in enterprise environments:

Issue	Root Cause	Resolution
Email Already in Use	AWS account emails must be globally unique across all of AWS.	Use email sub-addressing (e.g., `ops+acc1@company.com`) if supported by your provider.
STS Timeout	AFT cannot assume the `AWSControlTowerExecution` role in the new account.	Check if a Service Control Policy (SCP) is blocking `sts:AssumeRole` in the target OU.
Customization Loop	Terraform state mismatch in the AFT pipeline.	Manually clear the DynamoDB lock table for the specific account ID in the AFT Management account.

Frequently Asked Questions

Can I use Terraform to deploy accounts without Control Tower?

Yes, using the aws_organizations_account resource. However, you lose the managed guardrails and automated dashboarding provided by Control Tower. For expert-level setups, Control Tower + AFT is the industry standard for compliance.

How does AFT handle Terraform state?

AFT manages state files in an S3 bucket within the AFT Management account. It creates a unique state key for each account it provisions to ensure isolation and prevent blast-radius issues during updates.

How long does a typical AWS account deployment take via AFT?

Usually between 20 to 45 minutes. This includes the time AWS takes to provision the physical account container, apply Control Tower guardrails, and run your custom Terraform modules.

Conclusion

Mastering AWS account deployment requires a shift from manual administration to a software engineering mindset. By treating your accounts as immutable infrastructure and managing them through Terraform and AWS Control Tower, you gain the ability to scale your cloud footprint with confidence. Whether you are managing five accounts or five thousand, the combination of AFT and IaC provides the consistency and auditability required by modern regulatory frameworks. For further technical details, refer to the Official AFT Documentation. Thank you for reading the DevopsRoles page!

Terraform

Terraform Secrets: Deploy Your Terraform Workers Like a Pro

12/26/2025 HuuPV Leave a comment

If you are reading this, you’ve likely moved past the “Hello World” stage of Infrastructure as Code. You aren’t just spinning up a single EC2 instance; you are orchestrating fleets. Whether you are managing high-throughput Celery nodes, Kubernetes worker pools, or self-hosted Terraform Workers (Terraform Cloud Agents), the game changes at scale.

In this guide, we dive deep into the architecture of deploying resilient, immutable worker nodes. We will move beyond basic resource blocks and explore lifecycle management, drift detection strategies, and the “cattle not pets” philosophy that distinguishes a Junior SysAdmin from a Staff Engineer.

The Philosophy of Immutable Terraform Workers

When we talk about Terraform Workers in an expert context, we are usually discussing compute resources that perform background processing. The biggest mistake I see in production environments is treating these workers as mutable infrastructure—servers that are patched, updated, and nursed back to health.

To deploy workers like a pro, you must embrace Immutability. Your Terraform configuration should not describe changes to a worker; it should describe the replacement of a worker.

GigaCode Pro-Tip: Stop using remote-exec provisioners to configure your workers. It introduces brittleness and makes your terraform apply dependent on SSH connectivity and runtime package repositories. Instead, shift left. Use HashiCorp Packer to bake your dependencies into a Golden Image, and use Terraform solely for orchestration.

Architecting Resilient Worker Fleets

Let’s look at the actual HCL required to deploy a robust fleet of workers. We aren’t just using aws_instance; we are using Launch Templates and Auto Scaling Groups (ASGs) to ensure self-healing capabilities.

1. The Golden Image Strategy

Your Terraform Workers should boot instantly. If your user_data script takes 15 minutes to install Python dependencies, your autoscaling events will be too slow to handle traffic spikes.

data "aws_ami" "worker_golden_image" {
  most_recent = true
  owners      = ["self"]

  filter {
    name   = "name"
    values = ["my-worker-image-v*"]
  }

  filter {
    name   = "tag:Status"
    values = ["production"]
  }
}

2. Zero-Downtime Rotation with Lifecycle Blocks

One of the most powerful yet underutilized features for managing workers is the lifecycle meta-argument. When you update a Launch Template, Terraform’s default behavior might be aggressive.

To ensure you don’t kill active jobs, use create_before_destroy within your resource definitions. This ensures new workers are healthy before the old ones are terminated.

resource "aws_autoscaling_group" "worker_fleet" {
  name                = "worker-asg-${aws_launch_template.worker.latest_version}"
  min_size            = 3
  max_size            = 10
  vpc_zone_identifier = module.vpc.private_subnets

  launch_template {
    id      = aws_launch_template.worker.id
    version = "$Latest"
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 90
    }
  }

  lifecycle {
    create_before_destroy = true
    ignore_changes        = [load_balancers, target_group_arns]
  }
}

Specific Use Case: Terraform Cloud Agents (Self-Hosted Workers)

Sometimes, “Terraform Workers” refers specifically to Terraform Cloud Agents. These are specialized workers you deploy in your own private network to execute Terraform runs on behalf of Terraform Cloud (TFC) or Terraform Enterprise (TFE). This allows TFC to manage resources behind your corporate firewall without whitelisting public IPs.

Security & Isolation

When deploying TFC Agents, security is paramount. These workers hold the “Keys to the Kingdom”—they need broad IAM permissions to provision infrastructure.

Network Isolation: Deploy these workers in private subnets with no ingress access, only egress (443) to app.terraform.io.
Ephemeral Tokens: Do not hardcode the TFC Agent Token. Inject it via a secrets manager (like AWS Secrets Manager or HashiCorp Vault) at runtime.
Single-Use Agents: For maximum security, configure your agents to terminate after a single job (if your architecture supports high churn) to prevent credential caching attacks.

# Example: Passing a TFC Agent Token securely via User Data
resource "aws_launch_template" "tfc_agent" {
  name_prefix   = "tfc-agent-"
  image_id      = data.aws_ami.ubuntu.id
  instance_type = "t3.medium"

  user_data = base64encode(<<-EOF
              #!/bin/bash
              # Fetch token from Secrets Manager (requires IAM role)
              export TFC_AGENT_TOKEN=$(aws secretsmanager get-secret-value --secret-id tfc-agent-token --query SecretString --output text)
              
              # Start the agent container
              docker run -d --restart always \
                --name tfc-agent \
                -e TFC_AGENT_TOKEN=$TFC_AGENT_TOKEN \
                -e TFC_AGENT_NAME="worker-$(hostname)" \
                hashicorp/tfc-agent:latest
              EOF
  )
}

Advanced Troubleshooting & Drift Detection

Even the best-architected Terraform Workers can experience drift. This happens when a process on the worker changes a configuration file, or a manual intervention occurs.

Detecting “Zombie” Workers

A common failure mode is a worker that passes the EC2 status check but fails the application health check. Terraform generally looks at the cloud provider API status.

The Solution: decouple your health checks. Use Terraform to provision the infrastructure, but rely on the Autoscaling Group’s health_check_type = "ELB" (if using Load Balancers) or custom CloudWatch alarms to terminate unhealthy instances. Terraform’s job is to define the state of the fleet, not monitor the health of the application process inside it.

Frequently Asked Questions (FAQ)

1. Should I use Terraform `count` or `for_each` for worker nodes?

For identical worker nodes (like an ASG), you generally shouldn’t use either—you should use an Autoscaling Group resource which handles the count dynamically. However, if you are deploying distinct workers (e.g., “Worker-High-CPU” vs “Worker-High-Mem”), use for_each. It allows you to add or remove specific workers without shifting the index of all other resources, which happens with count.

2. How do I handle secrets on my Terraform Workers?

Never commit secrets to your Terraform state or code. Use IAM Roles (Instance Profiles) attached to the workers. The code running on the worker should use the AWS SDK (or equivalent) to fetch secrets from a managed service like AWS Secrets Manager or Vault at runtime.

3. What is the difference between Terraform Workers and Cloudflare Workers?

This is a common confusion. Terraform Workers (in this context) are compute instances managed by Terraform. Cloudflare Workers are a serverless execution environment provided by Cloudflare. Interestingly, you can use the cloudflare Terraform provider to manage Cloudflare Workers, treating the serverless code itself as an infrastructure resource!

Conclusion

Deploying Terraform Workers effectively requires a shift in mindset from “managing servers” to “managing fleets.” By leveraging Golden Images, utilizing ASG lifecycle hooks, and securing your TFC Agents, you elevate your infrastructure from fragile to anti-fragile.

Remember, the goal of an expert DevOps engineer isn’t just to write code that works; it’s to write code that scales, heals, and protects itself. Thank you for reading the DevopsRoles page!

Terraform

Terraform AWS IAM: Simplify Policy Management Now

12/16/2025 HuuPV Leave a comment

For expert DevOps engineers and SREs, managing Identity and Access Management (IAM) at scale is rarely about clicking buttons in the AWS Console. It is about architectural purity, auditability, and the Principle of Least Privilege. When implemented correctly, Terraform AWS IAM management transforms a potential security swamp into a precise, version-controlled fortress.

However, as infrastructure grows, so does the complexity of JSON policy documents, cross-account trust relationships, and conditional logic. This guide moves beyond the basics of resource "aws_iam_user" and dives into advanced patterns for constructing scalable, maintainable, and secure IAM hierarchies using HashiCorp Terraform.

The Evolution from Raw JSON to HCL Data Sources

In the early days of Terraform, engineers often embedded raw JSON strings into their aws_iam_policy resources using Heredoc syntax. While functional, this approach is brittle. It lacks syntax validation during the terraform plan phase and makes dynamic interpolation painful.

The expert standard today relies heavily on the aws_iam_policy_document data source. This allows you to write policies in HCL (HashiCorp Configuration Language), enabling leveraging Terraform’s native logic capabilities like dynamic blocks and conditionals.

Why aws_iam_policy_document is Superior

Validation: Terraform validates HCL syntax before the API call is made.
Composability: You can merge multiple data sources using the source_policy_documents or override_policy_documents arguments, allowing for modular policy construction.
Readability: It abstracts the JSON formatting, letting you focus on the logic.

Advanced Example: Dynamic Conditions and Merging

data "aws_iam_policy_document" "base_deny" {
  statement {
    sid       = "DenyNonSecureTransport"
    effect    = "Deny"
    actions   = ["s3:*"]
    resources = ["arn:aws:s3:::*"]

    condition {
      test     = "Bool"
      variable = "aws:SecureTransport"
      values   = ["false"]
    }

    principals {
      type        = "AWS"
      identifiers = ["*"]
    }
  }
}

data "aws_iam_policy_document" "s3_read_only" {
  # Merge the base deny policy into this specific policy
  source_policy_documents = [data.aws_iam_policy_document.base_deny.json]

  statement {
    sid       = "AllowS3List"
    effect    = "Allow"
    actions   = ["s3:ListBucket", "s3:GetObject"]
    resources = [
      var.s3_bucket_arn,
      "${var.s3_bucket_arn}/*"
    ]
  }
}

resource "aws_iam_policy" "secure_read_only" {
  name   = "secure-s3-read-only"
  policy = data.aws_iam_policy_document.s3_read_only.json
}

Pro-Tip: Use override_policy_documents sparingly. While powerful for hot-fixing policies in downstream modules, it can obscure the final policy outcome, making debugging permissions difficult. Prefer source_policy_documents for additive composition.

Mastering Trust Policies (Assume Role)

One of the most common friction points in Terraform AWS IAM is the “Assume Role Policy” (or Trust Policy). Unlike standard permission policies, this defines who can assume the role.

Hardcoding principals in JSON is a mistake when working with dynamic environments (e.g., ephemeral EKS clusters). Instead, leverage the aws_iam_policy_document for trust relationships as well.

Pattern: IRSA (IAM Roles for Service Accounts)

When working with Kubernetes (EKS), you often need to construct OIDC trust relationships. This requires precise string manipulation to match the OIDC provider URL and the specific Service Account namespace/name.

data "aws_iam_policy_document" "eks_oidc_assume" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]

    principals {
      type        = "Federated"
      identifiers = [var.oidc_provider_arn]
    }

    condition {
      test     = "StringEquals"
      variable = "${replace(var.oidc_provider_url, "https://", "")}:sub"
      values   = ["system:serviceaccount:${var.namespace}:${var.service_account_name}"]
    }
  }
}

resource "aws_iam_role" "app_role" {
  name               = "eks-app-role"
  assume_role_policy = data.aws_iam_policy_document.eks_oidc_assume.json
}

Handling Circular Dependencies

A classic deadlock occurs when you try to create an IAM Role that needs to be referenced in a Policy, which is then attached to that Role. Terraform’s graph dependency engine usually handles this well, but edge cases exist, particularly with S3 Bucket Policies referencing specific Roles.

To resolve this, rely on aws_iam_role.name or aws_iam_role.arn strictly where needed. If a circular dependency arises (e.g., KMS Key Policy referencing a Role that needs the Key ARN), you may need to break the cycle by using a separate aws_iam_role_policy_attachment resource rather than inline policies, or by using data sources to look up ARNs if the resources are loosely coupled.

Scaling with Modules: The “Terraform AWS IAM” Ecosystem

Writing every policy from scratch violates DRY (Don’t Repeat Yourself). For enterprise-grade implementations, the Community AWS IAM Module is the gold standard.

It abstracts complex logic for creating IAM users, groups, and assumable roles. However, for highly specific internal platforms, building a custom internal module is often better.

When to Build vs. Buy (Use Community Module)

Scenario	Recommendation	Reasoning
Standard Roles (EC2, Lambda)	Community Module	Handles standard trust policies and common attachments instantly.
Complex IAM Users	Community Module	Simplifies PGP key encryption for secret keys and login profiles.
Strict Compliance (PCI/HIPAA)	Custom Module	Allows strict enforcement of Permission Boundaries and naming conventions hardcoded into the module logic.

Best Practices for Security & Compliance

1. Enforce Permission Boundaries

Delegating IAM creation to developer teams is risky. Using Permission Boundaries is the only safe way to allow teams to create roles. In Terraform, ensure your module accepts a permissions_boundary_arn variable and applies it to every role created.

2. Lock Down with `terraform-compliance` or OPA

Before your Terraform applies, your CI/CD pipeline should scan the plan. Tools like Open Policy Agent (OPA) or Sentinel can block Effect: Allow on Action: "*".

# Example Rego policy (OPA) to deny wildcard actions
deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_iam_policy"
  statement := json.unmarshal(resource.change.after.policy).Statement[_]
  statement.Effect == "Allow"
  statement.Action == "*"
  msg = sprintf("Wildcard action not allowed in policy: %v", [resource.name])
}

Frequently Asked Questions (FAQ)

Can I manage IAM resources across multiple AWS accounts with one Terraform apply?

Technically yes, using multiple provider aliases. However, this is generally an anti-pattern due to the “blast radius” risk. It is better to separate state files by account or environment and use a pipeline to orchestrate updates.

How do I import existing IAM roles into Terraform?

Use the import block (available in Terraform 1.5+) or the CLI command: terraform import aws_iam_role.example role_name. Be careful with attached policies; you must identify if they are inline policies or managed policy attachments and import those separately to avoid state drift.

Inline Policies vs. Managed Policies: Which is better?

Managed Policies (standalone aws_iam_policy resources) are superior. They are reusable, versioned by AWS (allowing rollback), and easier to audit. Inline policies die with the role and can bloat the state file significantly.

Conclusion

Mastering Terraform AWS IAM is about shifting from “making it work” to “making it governable.” By utilizing aws_iam_policy_document for robust HCL definitions, understanding the nuances of OIDC trust relationships, and leveraging modular architectures, you ensure your cloud security scales as fast as your infrastructure.

Start refactoring your legacy JSON Heredoc strings into data sources today to improve readability and future-proof your IAM strategy. Thank you for reading the DevopsRoles page!

AWS, Terraform

Master TimescaleDB Deployment on AWS using Terraform

12/03/2025 HuuPV Leave a comment

Time-series data is the lifeblood of modern observability, IoT, and financial analytics. While managed services exist, enterprise-grade requirements—such as strict data sovereignty, VPC peering latency, or custom ZFS compression tuning—often mandate a self-hosted architecture. This guide focuses on a production-ready TimescaleDB deployment on AWS using Terraform.

We aren’t just spinning up an EC2 instance; we are engineering a storage layer capable of handling massive ingest rates and complex analytical queries. We will leverage Infrastructure as Code (IaC) to orchestrate compute, high-performance block storage, and automated bootstrapping.

Architecture Decisions: optimizing for Throughput

Before writing HCL, we must define the infrastructure characteristics required by TimescaleDB. Unlike stateless microservices, database performance is bound by I/O and memory.

Compute (EC2): We will target memory-optimized instances (e.g., r6i or r7g families) to maximize the RAM available for PostgreSQL’s shared buffers and OS page cache.
Storage (EBS): We will separate the WAL (Write Ahead Log) from the Data directory.
- WAL Volume: Requires low latency sequential writes. io2 Block Express or high-throughput gp3.
- Data Volume: Requires high random read/write throughput. gp3 is usually sufficient, but striping multiple volumes (RAID 0) is a common pattern for extreme performance.
OS Tuning: We will use cloud-init to tune kernel parameters (hugepages, swappiness) and run timescaledb-tune automatically.

Pro-Tip: Avoid using burstable instances (T-family) for production databases. The CPU credit exhaustion can lead to catastrophic latency spikes during data compaction or high-ingest periods.

Phase 1: Provider & VPC Foundation

Assuming you have a VPC setup, let’s establish the security context. Your TimescaleDB instance should reside in a private subnet, accessible only via a Bastion host or VPN.

Security Group Definition

resource "aws_security_group" "timescale_sg" {
  name        = "timescaledb-sg"
  description = "Security group for TimescaleDB Node"
  vpc_id      = var.vpc_id

  # Inbound: PostgreSQL Standard Port
  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [var.app_security_group_id] # Only allow app tier
    description     = "Allow PGSQL access from App Tier"
  }

  # Outbound: Allow package updates and S3 backups
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "timescaledb-production-sg"
  }
}

Phase 2: Storage Engineering (EBS)

This is the critical differentiator for expert deployments. We explicitly define EBS volumes separate from the root device to ensure data persistence independent of the instance lifecycle and to optimize I/O channels.

# Data Volume - Optimized for Throughput
resource "aws_ebs_volume" "pg_data" {
  availability_zone = var.availability_zone
  size              = 500
  type              = "gp3"
  iops              = 12000 # Provisioned IOPS
  throughput        = 500   # MB/s

  tags = {
    Name = "timescaledb-data-vol"
  }
}

# WAL Volume - Optimized for Latency
resource "aws_ebs_volume" "pg_wal" {
  availability_zone = var.availability_zone
  size              = 100
  type              = "io2"
  iops              = 5000 

  tags = {
    Name = "timescaledb-wal-vol"
  }
}

resource "aws_volume_attachment" "pg_data_attach" {
  device_name = "/dev/sdf"
  volume_id   = aws_ebs_volume.pg_data.id
  instance_id = aws_instance.timescale_node.id
}

resource "aws_volume_attachment" "pg_wal_attach" {
  device_name = "/dev/sdg"
  volume_id   = aws_ebs_volume.pg_wal.id
  instance_id = aws_instance.timescale_node.id
}

Phase 3: The TimescaleDB Instance & Bootstrapping

We use the user_data attribute to handle the “Day 0” operations: mounting volumes, installing the TimescaleDB packages (which install PostgreSQL as a dependency), and applying initial configuration tuning.

Warning: Ensure your IAM Role attached to this instance has permissions for ec2:DescribeTags if you use cloud-init to self-discover volume tags, or s3:* if you automate WAL-G backups immediately.

resource "aws_instance" "timescale_node" {
  ami           = data.aws_ami.ubuntu.id # Recommend Ubuntu 22.04 or 24.04 LTS
  instance_type = "r6i.2xlarge"
  subnet_id     = var.private_subnet_id
  key_name      = var.key_name

  vpc_security_group_ids = [aws_security_group.timescale_sg.id]
  iam_instance_profile   = aws_iam_instance_profile.timescale_role.name

  root_block_device {
    volume_type = "gp3"
    volume_size = 50
  }

  # "Day 0" Configuration Script
  user_data = <<-EOF
    #!/bin/bash
    set -e
    
    # 1. Mount EBS Volumes
    # Note: NVMe device names may vary on Nitro instances (e.g., /dev/nvme1n1)
    mkfs.xfs /dev/sdf
    mkfs.xfs /dev/sdg
    mkdir -p /var/lib/postgresql/data
    mkdir -p /var/lib/postgresql/wal
    mount /dev/sdf /var/lib/postgresql/data
    mount /dev/sdg /var/lib/postgresql/wal
    
    # Persist mounts in fstab... (omitted for brevity)

    # 2. Add Timescale PPA & Install
    echo "deb https://packagecloud.io/timescale/timescaledb/ubuntu/ $(lsb_release -c -s) main" | sudo tee /etc/apt/sources.list.d/timescaledb.list
    wget --quiet -O - https://packagecloud.io/timescale/timescaledb/gpgkey | sudo apt-key add -
    apt-get update
    apt-get install -y timescaledb-2-postgresql-14

    # 3. Initialize Database
    chown -R postgres:postgres /var/lib/postgresql
    su - postgres -c "/usr/lib/postgresql/14/bin/initdb -D /var/lib/postgresql/data --waldir=/var/lib/postgresql/wal"

    # 4. Tune Configuration
    # This is critical: It calculates memory settings based on the instance type
    timescaledb-tune --quiet --yes --conf-path=/var/lib/postgresql/data/postgresql.conf

    # 5. Enable Service
    systemctl enable postgresql
    systemctl start postgresql
  EOF

  tags = {
    Name = "TimescaleDB-Primary"
  }
}

Optimizing Terraform for Stateful Resources

Managing databases with Terraform requires handling state carefully. Unlike a stateless web server, you cannot simply destroy and recreate this resource if you change a parameter.

Lifecycle Management

Use the lifecycle meta-argument to prevent accidental deletion of your primary database node.

lifecycle {
  prevent_destroy = true
  ignore_changes  = [
    ami, 
    user_data # Prevent recreation if boot script changes
  ]
}

Validation and Post-Deployment

Once terraform apply completes, verification is necessary. You should verify that the TimescaleDB extension is correctly loaded and that your memory settings reflect the timescaledb-tune execution.

Connect to your instance and run:

sudo -u postgres psql -c "SELECT * FROM pg_extension WHERE extname = 'timescaledb';"
sudo -u postgres psql -c "SHOW shared_buffers;"

For further reading on tuning parameters, refer to the official TimescaleDB Tune documentation.

Frequently Asked Questions (FAQ)

1. Can I use RDS for TimescaleDB instead of EC2?

Yes, AWS RDS for PostgreSQL supports the TimescaleDB extension. However, you are often limited to older versions of the extension, and you lose control over low-level filesystem tuning (like using ZFS for compression) which can be critical for high-volume time-series data.

2. How do I handle High Availability (HA) with this Terraform setup?

This guide covers a single-node deployment. For HA, you would expand the Terraform code to deploy a secondary EC2 instance in a different Availability Zone and configure Streaming Replication. Tools like Patroni are the industry standard for managing auto-failover on self-hosted PostgreSQL/TimescaleDB.

3. Why separate WAL and Data volumes?

WAL operations are sequential and synchronous. If they share bandwidth with random read/write operations of the Data volume, write latency will spike, causing backpressure on your ingestion pipeline. Separating them physically (different EBS volumes) ensures consistent write performance.

Conclusion

Mastering TimescaleDB Deployment on AWS requires moving beyond simple “click-ops” to a codified, reproducible infrastructure. By using Terraform to orchestrate not just the compute, but the specific storage characteristics required for time-series workloads, you ensure your database can scale with your data.

Next Steps: Once your instance is running, implement a backup strategy using WAL-G to stream backups directly to S3, ensuring point-in-time recovery (PITR) capabilities. Thank you for reading the DevopsRoles page!

Terraform

Unlock Reusable VPC Modules: Terraform for Dev/Stage/Prod Environments

11/26/2025 HuuPV Leave a comment

If you are managing infrastructure at scale, you have likely felt the pain of the “copy-paste” sprawl. You define a VPC for Development, then copy the code for Staging, and again for Production, perhaps changing a CIDR block or an instance count manually. This breaks the fundamental DevOps principle of DRY (Don’t Repeat Yourself) and introduces drift risk.

For Senior DevOps Engineers and SREs, the goal isn’t just to write code that works; it’s to architect abstractions that scale. Reusable VPC Modules are the cornerstone of a mature Infrastructure as Code (IaC) strategy. They allow you to define the “Gold Standard” for networking once and instantiate it infinitely across environments with predictable results.

In this guide, we will move beyond basic syntax. We will construct a production-grade, agnostic VPC module capable of dynamic subnet calculation, conditional resource creation (like NAT Gateways), and strict variable validation suitable for high-compliance Dev, Stage, and Prod environments.

Why Reusable VPC Modules Matter (Beyond DRY)

While reducing code duplication is the obvious benefit, the strategic value of modularizing your VPC architecture runs deeper.

Governance & Compliance: By centralizing your network logic, you enforce security standards (e.g., “Flow Logs must always be enabled” or “Private subnets must not have public IP assignment”) in a single location.
Testing & Versioning: You can version your module (e.g., v1.2.0). Production can remain pinned to a stable version while you iterate on features in Development, effectively applying software engineering lifecycles to your network.
Abstraction Complexity: A consumer of your module (perhaps a developer spinning up an ephemeral environment) shouldn’t need to understand Route Tables or NACLs. They should only need to provide a CIDR block and an Environment name.

Pro-Tip: Avoid the “God Module” anti-pattern. While it’s tempting to bundle the VPC, EKS, and RDS into one giant module, this leads to dependency hell. Keep your Reusable VPC Modules strictly focused on networking primitives: VPC, Subnets, Route Tables, Gateways, and ACLs.

Anatomy of a Production-Grade Module

Let’s build a module that calculates subnets dynamically based on Availability Zones (AZs) and handles environment-specific logic (like high availability in Prod vs. cost savings in Dev).

1. Input Strategy & Validation

Modern Terraform (v1.0+) allows for powerful variable validation. We want to ensure that downstream users don’t accidentally pass invalid CIDR blocks.

# modules/vpc/variables.tf

variable "environment" {
  description = "Deployment environment (dev, stage, prod)"
  type        = string
  validation {
    condition     = contains(["dev", "stage", "prod"], var.environment)
    error_message = "Environment must be one of: dev, stage, prod."
  }
}

variable "vpc_cidr" {
  description = "CIDR block for the VPC"
  type        = string
  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "Must be a valid IPv4 CIDR block."
  }
}

variable "az_count" {
  description = "Number of AZs to utilize"
  type        = number
  default     = 2
}

2. Dynamic Subnetting with `cidrsubnet`

Hardcoding subnet CIDRs (e.g., 10.0.1.0/24) is brittle. Instead, use the cidrsubnet function to mathematically carve up the VPC CIDR. This ensures no overlap and automatic scalability if you change the base CIDR size.

# modules/vpc/main.tf

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "${var.environment}-vpc"
  }
}

# Public Subnets
resource "aws_subnet" "public" {
  count                   = var.az_count
  vpc_id                  = aws_vpc.main.id
  # Example: 10.0.0.0/16 -> 10.0.1.0/24, 10.0.2.0/24, etc.
  cidr_block              = cidrsubnet(var.vpc_cidr, 8, count.index)
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.environment}-public-${data.aws_availability_zones.available.names[count.index]}"
    Tier = "Public"
  }
}

# Private Subnets
resource "aws_subnet" "private" {
  count             = var.az_count
  vpc_id            = aws_vpc.main.id
  # Offset the CIDR calculation by 'az_count' to avoid overlap with public subnets
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index + var.az_count)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "${var.environment}-private-${data.aws_availability_zones.available.names[count.index]}"
    Tier = "Private"
  }
}

3. Conditional NAT Gateways (Cost Optimization)

NAT Gateways are expensive. In a Dev environment, you might only need one shared NAT Gateway (or none if you use instances with public IPs for testing), whereas Prod requires High Availability (one NAT per AZ).

# modules/vpc/main.tf

locals {
  # If Prod, create NAT per AZ. If Dev/Stage, create only 1 NAT total to save costs.
  nat_gateway_count = var.environment == "prod" ? var.az_count : 1
}

resource "aws_eip" "nat" {
  count = local.nat_gateway_count
  domain = "vpc"
}

resource "aws_nat_gateway" "main" {
  count         = local.nat_gateway_count
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = {
    Name = "${var.environment}-nat-${count.index}"
  }
}

Implementing Across Environments

Once your Reusable VPC Module is polished, utilizing it across environments becomes a trivial exercise in configuration management. I recommend a directory-based structure over Terraform Workspaces for clearer isolation of state files and variable definitions.

Directory Structure

infrastructure/
├── modules/
│   └── vpc/ (The code we wrote above)
├── environments/
│   ├── dev/
│   │   └── main.tf
│   ├── stage/
│   │   └── main.tf
│   └── prod/
│       └── main.tf

The Implementation (DRY at work)

In environments/prod/main.tf, your code is now incredibly concise:

module "vpc" {
  source      = "../../modules/vpc"
  
  environment = "prod"
  vpc_cidr    = "10.0.0.0/16"
  az_count    = 3 # High Availability
}

Contrast this with environments/dev/main.tf:

module "vpc" {
  source      = "../../modules/vpc"
  
  environment = "dev"
  vpc_cidr    = "10.10.0.0/16" # Different CIDR
  az_count    = 2 # Lower cost
}

Advanced Patterns & Considerations

Tagging Standards

Effective tagging is non-negotiable for cost allocation and resource tracking. Use the default_tags feature in the AWS provider configuration to apply global tags, but ensure your module accepts a tags map variable to merge specific metadata.

Outputting Values for Dependency Injection

Your VPC module is likely the foundation for other modules (like EKS or RDS). Ensure you output the IDs required by these dependent resources.

# modules/vpc/outputs.tf

output "vpc_id" {
  value = aws_vpc.main.id
}

output "private_subnet_ids" {
  value = aws_subnet.private[*].id
}

output "public_subnet_ids" {
  value = aws_subnet.public[*].id
}

Frequently Asked Questions (FAQ)

Should I use the official `terraform-aws-modules/vpc/aws` or build my own?

For beginners or rapid prototyping, the community module is excellent. However, for Expert SRE teams, building your own Reusable VPC Module is often preferred. It reduces “bloat” (unused features from the community module) and allows strict adherence to internal naming conventions and security compliance logic that a generic module cannot provide.

How do I handle VPC Peering between these environments?

Generally, you should avoid peering Dev and Prod. However, if you need shared services (like a tooling VPC), create a separate vpc-peering module. Do not bake peering logic into the core VPC module, as it creates circular dependencies and makes the module difficult to destroy.

What about VPC Flow Logs?

Flow Logs should be a standard part of your reusable module. I recommend adding a variable enable_flow_logs (defaulting to true) and storing logs in S3 or CloudWatch Logs. This ensures that every environment spun up with your module has auditing enabled by default.

Conclusion

Transitioning to Reusable VPC Modules transforms your infrastructure from a collection of static scripts into a dynamic, versioned product. By abstracting the complexity of subnet math and resource allocation, you empower your team to deploy Dev, Stage, and Prod environments that are consistent, compliant, and cost-optimized.

Start refactoring your hardcoded network configurations today. Isolate your logic into a module, version it, and watch your drift disappear. Thank you for reading the DevopsRoles page!

Terraform

Rapid Prototyping GCP: Terraform, GitHub, Docker & Streamlit in GCP

11/19/2025 HuuPV Leave a comment

In my experience as a Senior Staff DevOps Engineer, I’ve often seen deployment friction halt brilliant ideas at the proof-of-concept stage. When the primary goal is validating a data product or ML model, speed is the most critical metric. This guide offers an expert-level strategy for achieving true Rapid Prototyping in GCP by integrating an elite toolset: Terraform for infrastructure-as-code, GitHub Actions for CI/CD, Docker for containerization, and Streamlit for the frontend application layer.

We’ll architect a highly automated, cost-optimized pipeline that enables a single developer to push a change to a Git branch and have a fully deployed, tested prototype running on Google Cloud Platform (GCP) minutes later. This methodology transforms your development lifecycle from weeks to hours.

The Foundational Stack for Rapid Prototyping in GCP

To truly master **Rapid Prototyping in GCP**, we must establish a robust, yet flexible, technology stack. Our chosen components prioritize automation, reproducibility, and minimal operational overhead:

Infrastructure: Terraform – Define all GCP resources (VPC, Cloud Run, Artifact Registry) declaratively. This ensures the environment is reproducible and easily torn down after validation.
Application Framework: Streamlit – Allows data scientists and ML engineers to create complex, interactive web applications using only Python, eliminating frontend complexity.
Containerization: Docker – Standardizes the application environment, bundling all dependencies (Python versions, libraries) and ensuring the prototype runs identically from local machine to GCP.
CI/CD & Source Control: GitHub & GitHub Actions – Provides the automated workflow for testing, building the Docker image, pushing it to Artifact Registry, and deploying the application to Cloud Run.

Pro-Tip: Choosing the GCP Target
For rapid prototyping of web-facing applications, **Google Cloud Run** is the superior choice over GKE or Compute Engine. It offers serverless container execution, scales down to zero (minimizing cost), and integrates seamlessly with container images from Artifact Registry.

Step 1: Defining Infrastructure with Terraform

Our infrastructure definition must be minimal but secure. We’ll set up a project, enable the necessary APIs, and define our key deployment targets: a **VPC network**, an **Artifact Registry** repository, and the **Cloud Run** service itself. The service will be made public for easy prototype sharing.

Required Terraform Code (main.tf Snippet):


resource "google_project_service" "apis" {
  for_each = toset([
    "cloudresourcemanager.googleapis.com",
    "cloudrun.googleapis.com",
    "artifactregistry.googleapis.com",
    "iam.googleapis.com"
  ])
  project = var.project_id
  service = each.key
  disable_on_destroy = false
}

resource "google_artifact_registry_repository" "repo" {
  location = var.region
  repository_id = var.repo_name
  format = "DOCKER"
}

resource "google_cloud_run_v2_service" "prototype_app" {
  name = var.service_name
  location = var.region

  template {
    containers {
      image = "${var.region}-docker.pkg.dev/${var.project_id}/${var.repo_name}/${var.image_name}:latest"
      resources {
        cpu_idle = true
        memory = "1Gi"
      }
    }
  }

  traffic {
    type = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST"
    percent = 100
  }

  // Allow unauthenticated access for rapid prototyping
  // See: https://cloud.google.com/run/docs/authenticating/public
  metadata {
    annotations = {
      "run.googleapis.com/ingress" = "all"
    }
  }
}

This code block uses the `latest` tag for true rapid iteration, though for production, a commit SHA tag is preferred. By keeping the service public, we streamline the sharing process, a critical part of **Rapid Prototyping GCP** solutions.

Step 2: Containerizing the Streamlit Application with Docker

The Streamlit application requires a minimal, multi-stage Dockerfile to keep image size small and build times fast.

Dockerfile Example:


# Stage 1: Builder
FROM python:3.10-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Stage 2: Production
FROM python:3.10-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.10/site-packages/ /usr/local/lib/python3.10/site-packages/
COPY --from=builder /usr/local/bin/ /usr/local/bin/
COPY . .

# Streamlit runs on port 8501 by default
EXPOSE 8501

# The command to run the application
CMD ["streamlit", "run", "app.py", "--server.port=8080", "--server.enableCORS=false"]

Note: We explicitly set the Streamlit port to **8080** via the `CMD` instruction, which is the mandatory listening port for Google Cloud Run’s container contract.

Step 3: Implementing CI/CD with GitHub Actions

The core of our **Rapid Prototyping GCP** pipeline is the CI/CD workflow, automated via GitHub Actions. A push to the `main` branch should trigger a container build, push, and deployment.

GitHub Actions Workflow (.github/workflows/deploy.yml):


name: Build and Deploy Prototype to Cloud Run

on:
  push:
    branches:
      - main
  workflow_dispatch:

env:
  PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }}
  GCP_REGION: us-central1
  SERVICE_NAME: streamlit-prototype
  REPO_NAME: prototype-repo
  IMAGE_NAME: streamlit-app

jobs:
  deploy:
    runs-on: ubuntu-latest
    
    permissions:
      contents: 'read'
      id-token: 'write' # Required for OIDC authentication

    steps:
    - name: Checkout Code
      uses: actions/checkout@v4

    - id: 'auth'
      name: 'Authenticate to GCP'
      uses: 'google-github-actions/auth@v2'
      with:
        workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
        service_account: ${{ secrets.SA_EMAIL }}

    - name: Set up Docker
      uses: docker/setup-buildx-action@v3

    - name: Build and Push Docker Image
      uses: docker/build-push-action@v5
      with:
        push: true
        tags: ${{ GCP_REGION }}-docker.pkg.dev/${{ PROJECT_ID }}/${{ REPO_NAME }}/${{ IMAGE_NAME }}:latest
        context: .
        
    - name: Deploy to Cloud Run
      uses: google-github-actions/deploy-cloudrun@v2
      with:
        service: ${{ env.SERVICE_NAME }}
        region: ${{ env.GCP_REGION }}
        image: ${{ GCP_REGION }}-docker.pkg.dev/${{ PROJECT_ID }}/${{ REPO_NAME }}/${{ IMAGE_NAME }}:latest

Advanced Concept: GitHub OIDC Integration
We use **Workload Identity Federation (WIF)**, not static service account keys, for secure authentication. The GitHub Action uses the `id-token: ‘write’` permission to exchange a short-lived token for GCP credentials, significantly enhancing the security posture of our CI/CD pipeline. Refer to the official GCP IAM documentation for setting up the required WIF pool and provider.

Best Practices for Iterative Development and Cost Control

A successful **Rapid Prototyping GCP** pipeline isn’t just about deployment; it’s about making iteration cheap and fast, and managing the associated cloud costs.

Rapid Iteration with Streamlit’s Application State

Leverage Streamlit’s native caching mechanisms (e.g., `@st.cache_data`, `@st.cache_resource`) and session state (`st.session_state`) effectively. This prevents re-running expensive computations (like model loading or large data fetches) on every user interaction, reducing application latency and improving the perceived speed of the prototype.

Cost Management with Cloud Run

Scale-to-Zero: Ensure your Cloud Run service is configured to scale down to 0 minimum instances (`min-instances: 0`). This is crucial. If the prototype isn’t being actively viewed, you pay nothing for compute time.
Resource Limits: Start with the lowest possible CPU/Memory allocation (e.g., 1vCPU, 512MiB) and increase only if necessary. Prototypes should be cost-aware.
Terraform Taint: For temporary projects, use `terraform destroy` when validation is complete. For environments that must persist, use `terraform taint` or manual deletion on the service, and a follow-up `terraform apply` to re-create it when needed.

Frequently Asked Questions (FAQ)

How is this Rapid Prototyping stack different from using App Engine or GKE?

The key difference is **operational overhead and cost**. App Engine (Standard) is limited by language runtimes, and GKE (Kubernetes) introduces significant complexity (managing nodes, deployments, services, ingress) that is unnecessary for a rapid proof-of-concept. Cloud Run is a fully managed container platform that handles autoscaling, patching, and networking, allowing you to focus purely on the application logic for your prototype.

What are the security implications of making the Cloud Run service unauthenticated?

Making the service public (`allow-unauthenticated`) is acceptable for internal or temporary prototypes, as it simplifies sharing. For prototypes that handle sensitive data or move toward production, you must update the Terraform configuration to remove the public access IAM policy and enforce authentication (e.g., using IAP or requiring a valid GCP identity token).

Can I use Cloud Build instead of GitHub Actions for this CI/CD?

Absolutely. Cloud Build is GCP’s native CI/CD platform and can be a faster alternative, especially for image builds that stay within the Google Cloud network. The GitHub Actions approach was chosen here for its seamless integration with the source control repository (GitHub) and its broad community support, simplifying the adoption for teams already using GitHub.

Conclusion

Building a modern **Rapid Prototyping GCP** pipeline requires a holistic view of the entire software lifecycle. By coupling the declarative power of **Terraform** with the automation of **GitHub Actions** and the serverless execution of **Cloud Run**, you gain an unparalleled ability to quickly validate ideas. This blueprint empowers expert DevOps teams and SREs to dramatically reduce the time-to-market for data applications and machine learning models, moving from concept to deployed, interactive prototype in minutes, not days. Thank you for reading the DevopsRoles page!

Terraform

How to deploy an EKS cluster using Terraform

10/30/2025 HuuPV Leave a comment

Welcome to the definitive guide on using Terraform to provision and manage Amazon Elastic Kubernetes Service (EKS). Manually setting up a Kubernetes cluster on AWS involves navigating a complex web of resources: VPCs, subnets, IAM roles, security groups, and the EKS control plane itself. This process is not only time-consuming but also prone to human error and difficult to reproduce.

This is where Terraform, the industry-standard Infrastructure as Code (IaC) tool, becomes indispensable. By defining your infrastructure in declarative configuration files, you can automate the entire provisioning process, ensuring consistency, repeatability, and version control. In this comprehensive tutorial, we will walk you through every step required to deploy an EKS cluster using Terraform, from setting up the networking to configuring node groups and connecting with kubectl. This guide is designed for DevOps engineers, SREs, and system administrators looking to adopt best practices for their Kubernetes cluster on AWS.

Why Use Terraform to Deploy an EKS Cluster?

While the AWS Management Console or AWS CLI are valid ways to start, any production-grade system benefits immensely from an IaC approach. When you deploy an EKS cluster, you’re not just creating one resource; you’re orchestrating dozens of interconnected components. Terraform excels at managing this complexity.

The Power of Infrastructure as Code (IaC)

Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. Terraform allows you to write, plan, and create your AWS EKS cluster setup with code. This code can be versioned in Git, peer-reviewed, and tested, just like your application code.

Repeatability and Consistency

Need to spin up an identical cluster for staging, development, or a different region? With a manual process, this is a nightmare of forgotten settings and configuration drift. With Terraform, you simply run terraform apply. Your configuration files are the single source of truth, guaranteeing that every environment is a precise, consistent replica.

State Management and Version Control

Terraform creates a state file that maps your configuration to the real-world resources it has created. This state allows Terraform to plan changes, understand dependencies, and manage the entire lifecycle of your infrastructure. When you need to upgrade your EKS version or change a node’s instance type, Terraform calculates the exact changes needed and executes them in the correct order. You can destroy the entire stack with a single terraform destroy command, ensuring no orphaned resources are left behind.

Prerequisites: What You Need Before You Start

Before we begin, ensure you have the following tools and accounts set up. This guide assumes you are comfortable working from the command line.

An AWS Account: You will need an AWS account with IAM permissions to create EKS clusters, VPCs, IAM roles, and associated resources.
AWS CLI: The AWS Command Line Interface, configured with your credentials (e.g., via aws configure).
Terraform: Terraform (version 1.0.0 or later) installed on your local machine.
kubectl: The Kubernetes command-line tool. This is used to interact with your cluster once it’s created.
aws-iam-authenticator (Optional but Recommended): This helper binary allows kubectl to use AWS IAM credentials for authentication. However, modern AWS CLI versions (1.16.156+) can handle this natively with the aws eks update-kubeconfig command, which we will use.

Step-by-Step Guide: Provisioning Your EKS Infrastructure

We will build our configuration using the official, battle-tested Terraform EKS module. This module abstracts away immense complexity and encapsulates best practices for EKS cluster provisioning.

Step 1: Setting Up Your Terraform Project

First, create a new directory for your project. Inside this directory, we’ll create several .tf files to keep our configuration organized.

Your directory structure will look like this:


.
├── main.tf
├── variables.tf
└── outputs.tf

Let’s start with main.tf. This file will contain our provider configuration and the module calls.


# main.tf

terraform {
  required_version = "~> 1.5"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

# Define a random string to ensure unique EKS cluster names
resource "random_pet" "cluster_name_suffix" {
  length = 2
}

Next, define your variables in variables.tf. This allows you to easily customize your deployment without changing the core logic.


# variables.tf

variable "aws_region" {
  description = "The AWS region to deploy resources in."
  type        = string
  default     = "us-east-1"
}

variable "cluster_name" {
  description = "The name for your EKS cluster."
  type        = string
  default     = "my-demo-cluster"
}

variable "cluster_version" {
  description = "The Kubernetes version for the EKS cluster."
  type        = string
  default     = "1.29"
}

variable "vpc_cidr" {
  description = "The CIDR block for the EKS cluster VPC."
  type        = string
  default     = "10.0.0.0/16"
}

variable "azs" {
  description = "Availability Zones for the VPC and EKS."
  type        = list(string)
  default     = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

Step 2: Defining the Networking (VPC)

An EKS cluster requires a robust, highly available Virtual Private Cloud (VPC) with both public and private subnets across multiple Availability Zones. We will use the official Terraform VPC module to handle this.

Add the following to your main.tf file:


# main.tf (continued...)

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.3"

  name = "${var.cluster_name}-vpc"
  cidr = var.vpc_cidr

  azs             = var.azs
  private_subnets = [for k, v in var.azs : cidrsubnet(var.vpc_cidr, 8, k + 4)]
  public_subnets  = [for k, v in var.azs : cidrsubnet(var.vpc_cidr, 8, k)]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  # Tags required by EKS
  public_subnet_tags = {
    "kubernetes.io/cluster/${var.cluster_name}-${random_pet.cluster_name_suffix.id}" = "shared"
    "kubernetes.io/role/elb"                                                         = "1"
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/${var.cluster_name}-${random_pet.cluster_name_suffix.id}" = "shared"
    "kubernetes.io/role/internal-elb"                                                = "1"
  }
}

This block provisions a new VPC with public subnets (for load balancers) and private subnets (for worker nodes) across the three AZs we defined. Crucially, it adds the specific tags that EKS requires to identify which subnets it can use for internal and external load balancers.

Step 3: Defining the EKS Cluster with the Official Module

Now for the main event. We will add the terraform-aws-modules/eks/aws module. This single module will create:

The EKS Control Plane
The necessary IAM Roles (for the cluster and nodes)
Security Groups
Managed Node Groups

Add this final block to your main.tf:


# main.tf (continued...)

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.8.4"

  cluster_name    = "${var.cluster_name}-${random_pet.cluster_name_suffix.id}"
  cluster_version = var.cluster_version

  cluster_endpoint_public_access = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  # EKS Managed Node Group configuration
  eks_managed_node_groups = {
    general_purpose = {
      name           = "general-purpose-nodes"
      instance_types = ["t3.medium"]

      min_size     = 1
      max_size     = 3
      desired_size = 2

      # Use the private subnets
      subnet_ids = module.vpc.private_subnets

      tags = {
        Purpose = "general-purpose-workloads"
      }
    }
  }

  # This allows our local kubectl to authenticate
  # by mapping the default AWS user/role that runs terraform
  # to the "system:masters" group in Kubernetes RBAC.
  aws_auth_roles = [
    {
      rolearn = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/AWSServiceRoleForAmazonEKS"
      username = "system:node:{{EC2PrivateDNSName}}"
      groups = [
        "system:bootstrappers",
        "system:nodes",
      ]
    }
  ]
  
  aws_auth_users = [
    {
      userarn = data.aws_caller_identity.current.arn
      username = "admin"
      groups = [
        "system:masters"
      ]
    }
  ]

}

data "aws_caller_identity" "current" {}

This configuration defines an EKS cluster and a managed node group named general_purpose. This node group will run t3.medium instances and will auto-scale between 1 and 3 nodes, starting with 2. The aws_auth_users block is critical: it takes the IAM identity (user or role) that is running Terraform and grants it system:masters (admin) permissions within the new Kubernetes cluster.

Step 4: Defining Outputs

Finally, we need to output the cluster’s information so we can connect to it. Create an outputs.tf file.


# outputs.tf

output "cluster_name" {
  description = "The name of the EKS cluster."
  value       = module.eks.cluster_name
}

output "cluster_endpoint" {
  description = "The endpoint for your EKS cluster."
  value       = module.eks.cluster_endpoint
}

output "cluster_ca_certificate" {
  description = "Base64 encoded certificate data for cluster."
  value       = module.eks.cluster_certificate_authority_data
}

output "configure_kubectl_command" {
  description = "Command to configure kubectl to connect to the cluster."
  value       = "aws eks update-kubeconfig --region ${var.aws_region} --name ${module.eks.cluster_name}"
}

Step 5: Deploying and Connecting to Your Cluster

With all our configuration files in place, it’s time to deploy.

Initialize and Apply

Run the following commands in your terminal:


# 1. Initialize the Terraform project
# This downloads the AWS provider and the EKS/VPC modules
terraform init

# 2. Plan the deployment
# This shows you all the resources Terraform will create
terraform plan

# 3. Apply the configuration
# This will build the VPC, IAM roles, and EKS cluster.
# It can take 15-20 minutes for the EKS cluster to become active.
terraform apply --auto-approve

After terraform apply completes, it will print the values from your outputs.tf file.

Configuring `kubectl`

The easiest way to configure your local kubectl is to use the command we generated in our outputs. Copy the value of configure_kubectl_command from the Terraform output and paste it into your terminal.


# This command will be printed by 'terraform apply'
aws eks update-kubeconfig --region us-east-1 --name my-demo-cluster-xy

This AWS CLI command automatically updates your local ~/.kube/config file with the new cluster’s credentials and endpoint.

Verifying the Cluster

You can now use kubectl to interact with your cluster. Let’s check the status of our nodes:


kubectl get nodes

# You should see an output similar to this, showing your 2 nodes are 'Ready':
# NAME                                         STATUS   ROLES    AGE   VERSION
# ip-10-0-10-123.ec2.internal   Ready       5m    v1.29.0-eks
# ip-10-0-11-45.ec2.internal    Ready       5m    v1.29.0-eks

You can also check the pods running in the kube-system namespace:


kubectl get pods -n kube-system

# You will see core components like coredns and the aws-node (VPC CNI) pods.

Congratulations! You have successfully deployed a production-ready EKS cluster using Terraform.

Advanced Considerations and Best Practices

This guide provides a strong foundation, but a true production environment has more components. Here are key areas to explore next:

IAM Roles for Service Accounts (IRSA): Instead of giving broad permissions to worker nodes, use IRSA to assign fine-grained IAM roles directly to your Kubernetes service accounts. This is the most secure way for your pods (e.g., external-dns, aws-load-balancer-controller) to interact with AWS APIs. The Terraform EKS module has built-in support for this.
Cluster Autoscaling: We configured the Managed Node Group to scale, but for more advanced scaling based on pod resource requests, you should deploy the Kubernetes Cluster Autoscaler.
EKS Add-ons: EKS manages core add-ons like vpc-cni, kube-proxy, and coredns. You can manage the versions of these add-ons directly within the Terraform EKS module block, treating them as code as well.
Logging and Monitoring: Configure EKS control plane logging (api, audit, authenticator) and ship those logs to CloudWatch. Use the EKS module to enable these logs and deploy monitoring solutions like Prometheus and Grafana.

Frequently Asked Questions

Can I use this guide to deploy an EKS cluster into an existing VPC?: Yes. Instead of using the module "vpc", you would remove that block and pass your existing VPC and subnet IDs directly to the module "eks" block’s vpc_id and subnet_ids arguments. You must ensure your subnets are tagged correctly as described in Step 2.
How do I upgrade my EKS cluster’s Kubernetes version using Terraform?: It’s a two-step process. First, update the cluster_version argument in your variables.tf (e.g., from “1.29” to “1.30”). Run terraform apply to upgrade the control plane. Once that is complete, you must also upgrade your node groups by updating their version (or by default, they will upgrade on the next AMI rotation if configured).
What is the difference between Managed Node Groups and Fargate?: Managed Node Groups (used in this guide) provision EC2 instances that you manage (but AWS patches). You have full control over the instance type and operating system. AWS Fargate is a serverless compute engine that allows you to run pods without managing any underlying EC2 instances at all. The EKS module also supports creating Fargate profiles.

Conclusion

You have now mastered the fundamentals of EKS cluster provisioning with Infrastructure as Code. By leveraging the official Terraform EKS module, you’ve abstracted away massive complexity and built a scalable, repeatable, and maintainable foundation for your Kubernetes workloads on AWS. This declarative approach is the cornerstone of modern DevOps and SRE practices, enabling you to manage infrastructure with the same rigor and confidence as application code.

By following this guide, you’ve learned not just how to deploy EKS cluster infrastructure, but how to do it in a way that is robust, scalable, and manageable. From here, you are well-equipped to explore advanced topics like IRSA, cluster autoscaling, and CI/CD pipelines for your new Kubernetes cluster. Thank you for reading the DevopsRoles page!

The Architecture of Policy Enforcement

Anatomy of an AWS Sentinel Policy

1. The Setup

2. The Logic Rule

Advanced AWS Scaling Patterns

Pattern 1: Cost Control via Instance Type Allow-Listing

Pattern 2: Enforcing Mandatory Tags for Cost Allocation

Testing and Mocking Policies

Enforcement Levels: The Deployment Strategy

Frequently Asked Questions (FAQ)

How does Sentinel differ from OPA (Open Policy Agent)?

Can I access cost estimates in my policy?

Does Sentinel affect the performance of Terraform runs?

Conclusion

Why Terraform for a Game Server?

Architecture Overview

Step 1: The Network & Security Layer

Step 2: Persistent Storage Strategy

Step 3: The Compute Instance & Cloud-Init

The Cloud-Init Script (setup.sh.tpl)

Managing State and Updates

Cost Optimization: The Weekend Warrior Pattern

Frequently Asked Questions (FAQ)

Can I use Terraform to manage in-game mods?

How do I handle the initial world generation?

Is Terraform overkill for a single server?

Conclusion

The Architecture of AI-Native Infrastructure

1. Provisioning the Knowledge Base (Vector Store)

2. Securing LLM Credentials

3. Deploying the Agent Runtime (ECS Fargate)

4. Automating the Lifecycle with Terraform & CI/CD

The “Blue/Green” Strategy for AI Agents

Frequently Asked Questions (FAQ)

Can Terraform manage the actual LLM models?

How do I handle GPU provisioning for self-hosted LLMs in Terraform?

Is Terraform suitable for prompt management?

Conclusion

The Foundations: Why Control Tower & Terraform?

Deep Dive: Account Factory for Terraform (AFT)

The AFT Component Stack

Step-by-Step: Deploying Your First Managed Account

Production-Ready Best Practices

Troubleshooting Common Deployment Failures

Frequently Asked Questions

Can I use Terraform to deploy accounts without Control Tower?

How does AFT handle Terraform state?

How long does a typical AWS account deployment take via AFT?

Conclusion

The Philosophy of Immutable Terraform Workers

Architecting Resilient Worker Fleets

1. The Golden Image Strategy

2. Zero-Downtime Rotation with Lifecycle Blocks

Specific Use Case: Terraform Cloud Agents (Self-Hosted Workers)

Security & Isolation

Advanced Troubleshooting & Drift Detection

Detecting “Zombie” Workers

Frequently Asked Questions (FAQ)

1. Should I use Terraform count or for_each for worker nodes?

2. How do I handle secrets on my Terraform Workers?

3. What is the difference between Terraform Workers and Cloudflare Workers?

Conclusion

The Evolution from Raw JSON to HCL Data Sources

Why aws_iam_policy_document is Superior

Advanced Example: Dynamic Conditions and Merging

Mastering Trust Policies (Assume Role)

Pattern: IRSA (IAM Roles for Service Accounts)

Handling Circular Dependencies

Scaling with Modules: The “Terraform AWS IAM” Ecosystem

When to Build vs. Buy (Use Community Module)

Best Practices for Security & Compliance

1. Enforce Permission Boundaries

2. Lock Down with terraform-compliance or OPA

Frequently Asked Questions (FAQ)

Can I manage IAM resources across multiple AWS accounts with one Terraform apply?

How do I import existing IAM roles into Terraform?

Inline Policies vs. Managed Policies: Which is better?

Conclusion

Architecture Decisions: optimizing for Throughput

Phase 1: Provider & VPC Foundation

1. Should I use Terraform `count` or `for_each` for worker nodes?

2. Lock Down with `terraform-compliance` or OPA

2. Dynamic Subnetting with `cidrsubnet`

Should I use the official `terraform-aws-modules/vpc/aws` or build my own?

Configuring `kubectl`