Mastering Terraform as a Site Reliability Engineer: A Complete Guide for Multi-Cloud and Multi-Environment Automation

Post views: 856

Infrastructure as Code (IaC) is at the heart of modern SRE and DevOps practices. As an SRE, using Terraform empowers you to provision, manage, and scale infrastructure predictably across cloud environments. In this guide, I will walk you through the key reasons and examples that showcase why and how I use Terraform, particularly for multi-region and multi-cloud deployments, provisioning automation, and secure secret management.

Why Use Terraform?

Terraform is a declarative IaC tool that helps you manage infrastructure consistently, reliably, and at scale. Here are a few key benefits:

Cloud Agnostic: Supports AWS, Azure, GCP, and many others.
Multi-region Deployment: Define and manage resources across regions from a single codebase.
Modular and Reusable: Write once, reuse with variables and modules.
Version Controlled: Keep infrastructure definitions in Git to track changes and enable collaboration.
Automation-Ready: Integrates seamlessly with CI/CD pipelines and tools like GitHub Actions.

Multi-Region and Multi-Cloud Deployments

In a globally distributed system, SREs often need to deploy services across multiple regions or even different cloud providers. Terraform makes this easy by allowing the use of multiple provider blocks. This approach enables:

High availability and disaster recovery
Latency optimization by placing resources closer to users
Vendor independence and resiliency across cloud providers

This addresses typical SRE goals such as minimizing downtime, reducing failure domains, and simplifying cross-cloud scaling. Using Terraform, you can define multiple provider blocks with aliases to manage different regions or cloud providers:

Multi-Region Example (AWS)

provider "aws" {
  alias  = "us-east-1"
  region = "us-east-1"
}

provider "aws" {
  alias  = "us-west-2"
  region = "us-west-2"
}

resource "aws_instance" "east" {
  ami           = "ami-0123456789abcdef0"
  instance_type = "t2.micro"
  provider      = aws.us-east-1
}

resource "aws_instance" "west" {
  ami           = "ami-0123456789abcdef0"
  instance_type = "t2.micro"
  provider      = aws.us-west-2
}

Multi-Cloud Example (AWS + Azure)

provider "aws" {
  region = "us-east-1"
}

provider "azurerm" {
  features        = {}
  subscription_id = "<subscription_id>"
  client_id       = "<client_id>"
  client_secret   = "<client_secret>"
  tenant_id       = "<tenant_id>"
}

resource "aws_instance" "example" {
  ami           = "ami-0123456789abcdef0"
  instance_type = "t2.micro"
}

resource "azurerm_virtual_machine" "example" {
  name     = "example-vm"
  location = "eastus"
  size     = "Standard_A1"
  # other required fields
}

Terraform Variables: Inputs, Outputs, and Tfvars

Variables in Terraform promote reusability, parameterization, and separation of configuration from logic. Instead of hardcoding values like instance types, region names, or AMI IDs, we can define them as variables and supply different values per environment, workspace, or CI/CD pipeline.

This makes it easier to:

Deploy to multiple environments (e.g., dev, stage, prod)
Reduce code duplication
Handle sensitive values securely
Support team collaboration and maintain clean, scalable infrastructure definitions

Input Variables

Variables can be defined in its own (e.g. variables.tf) or directly in your main.tf, these are the parameters your module or configuration expects.

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t2.micro"
}

resource "aws_instance" "example_instance" {
  ami           = var.ami_id
  instance_type = var.instance_type
}

You can override input variable values in several ways:

From CLI: terraform apply -var="instance_type=t3.micro"
From a .tfvars file
From environment variables prefixed with TF_VAR_

Output Variables

Output variables expose computed information after a Terraform run. These are useful for referencing values in other modules or for displaying information. For example, following output variable prints out the public ip address after an EC2 instance is created.

output "public_ip" {
  description = "Public IP address of the EC2 instance"
  value       = aws_instance.example_instance.public_ip
}

Tfvars File

The default terraform.tfvars file helps organize values for input variables by supplying their values:

cidr              = "10.0.0.0/16"
instance_id       = "ami-014e30c8a36252ae5"
instance_type     = "t2.micro"
bucket_name       = "terraform-s3-bucket"
region            = "us-west-1"
availability_zones = ["us-west-1a", "us-west-1b"]

To use a different tfvars file during resources creation:

terraform apply -var-file=dev.tfvars

Using Modules for Reusability

Terraform modules are reusable containers for multiple resources used together. They help you:

Encapsulate logic for deploying standardized components (e.g., EC2 instance, VPC)
Promote DRY principles by avoiding duplicated code
Standardize deployments across teams and environments

In SRE practice, this supports automation, consistency, and faster onboarding, while reducing the risk of human error when provisioning infrastructure repeatedly.

For example, in this \module\ec2_instance\main.tf, these blocks defines a module that creates an EC2 instance:

variable "ami_value" {
  description = "AMI ID for the EC2 instance"
  type        = string
}
variable "instance_type_value" {
  description = "Type of the EC2 instance"
  type        = string
}

resource "aws_instance" "example" {
  ami           = var.ami_value
  instance_type = var.instance_type_value
}

In a multi-user environment, when any teams want to creates an EC2 instance, they simply write a main.tf that pass the desired values to the modules (i.e. ami_value & instance_type_value) to the module to create the resources they need, thus speed up the time required to provision their infrastructure and reduce human error.

provider "aws" {
  region = "us-west-1"
}

module "ec2_instance" {
  source              = "./module/ec2_instance"
  ami_value           = "ami-014e30c8a36252ae5"
  instance_type_value = "t2.micro"
}

Terraform State Management and Remote Backend

Terraform uses a state file to record the current state of your infrastructure. This file is essential because Terraform relies on it to determine what actions are required to bring the infrastructure in sync with your code. It keeps track of created resources, attributes, and dependencies.

By default, this state file is stored locally on your machine, which works fine for solo projects — but becomes problematic in multi-user or team environments.

Drawbacks of Local State in Team Settings

No shared visibility — Only the person with the local file knows the current state.
High risk of overwrites — Two users applying changes simultaneously may corrupt or overwrite each other’s work.
No locking — There’s no mechanism to prevent multiple Terraform runs from happening at once.
Configuration drift — Without a consistent, central state, environments can fall out of sync.
Security risks — Local state may contain sensitive data (e.g. passwords, tokens) and is vulnerable to leaks if not properly secured.

Remote Backend: The Solution

To solve these issues, Terraform supports remote backends — storage systems where the state file is saved and accessed centrally. Popular options include:

AWS S3 (with DynamoDB for locking)
Terraform Cloud
Azure Blob Storage

Using a remote backend:

Provides a central source of truth
Enables team collaboration
Supports state locking and history tracking
Helps enforce CI/CD workflows safely

What is Locking and Why Use DynamoDB?

When multiple people or automation pipelines interact with the same Terraform state, locking prevents conflicts. It ensures that only one process can update the state at a time.

Amazon DynamoDB is often used with S3 backends to manage this lock:

Atomic operations to prevent race conditions
Blocking and waiting if a lock is already held
Scalable and highly available — no single point of failure

In SRE practices, this setup ensures reliable and concurrent-safe infrastructure changes — even with multiple developers or automation workflows.

Remote Backend with Locking Example

terraform {
  backend "s3" {
    bucket         = "wallace-s3-terraform-state-files"
    key            = "wallace/terraform.tfstate"
    region         = "us-west-1"
    dynamodb_table = "terraform-lock"  # Enables locking
  }
}

This blocks tells Terraform to use AWS S3 bucket to maintain the state file and lock file centrally in DynamoDB.

Following is a sample block to provision a DynamoDB for maintain the lock file using Terraform:

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Terraform Provisioners

Provisioners in Terraform allow you to run scripts or commands after a resource is created. This is useful for bootstrapping, installing dependencies, or configuring systems on first boot.

In an SRE context, provisioners enable:

Rapid test environment setup for integration/QA
On-demand debugging or patching in dynamic environments
Simplified deployment pipelines for quick validation of infrastructure-as-code

Use them cautiously — if a provisioner fails, Terraform may mark the resource as tainted.

Full Provisioner-Based Infrastructure Example

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "6.8.0"
    }
  }
}

provider "aws" {
  region = "us-west-1"
}

variable "cidr" {
  default = "10.0.0.0/16"
}

resource "aws_key_pair" "example" {
  key_name   = "terraform-demo-wallace"
  public_key = file("~/.ssh/id_rsa.pub")
}

resource "aws_vpc" "main" {
  cidr_block = var.cidr
}

resource "aws_subnet" "sub1" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.0.0/24"
  availability_zone       = "us-west-1a"
  map_public_ip_on_launch = true
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id
}

resource "aws_route_table" "main" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.igw.id
  }
}

resource "aws_route_table_association" "rta1" {
  subnet_id      = aws_subnet.sub1.id
  route_table_id = aws_route_table.main.id
}

resource "aws_security_group" "web_sg" {
  name   = "web"
  vpc_id = aws_vpc.main.id

  ingress {
    description = "HTTP from VPC"
    from_port   = 8000
    to_port     = 8000
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "SSH"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    description = "Allow all outbound traffic"
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_instance" "web" {
  ami                    = "ami-014e30c8a36252ae5"
  instance_type          = "t2.micro"
  key_name               = aws_key_pair.example.key_name
  subnet_id              = aws_subnet.sub1.id
  vpc_security_group_ids = [aws_security_group.web_sg.id]
  associate_public_ip_address = true

  tags = {
    Name = "WebServer"
  }

  connection {
    type        = "ssh"
    user        = "ubuntu"
    private_key = file("~/.ssh/id_rsa")
    host        = self.public_ip
  }

  provisioner "file" {
    source      = "app.py"
    destination = "/home/ubuntu/app.py"
  }

  provisioner "remote-exec" {
    inline = [
      "echo 'Hello from the remote instance'",
      "sudo apt update -y",
      "sudo apt-get install -y python3-venv",
      "cd /home/ubuntu",
      "python3 -m venv appenv",
      "/home/ubuntu/appenv/bin/pip install --upgrade pip",
      "/home/ubuntu/appenv/bin/pip install flask",
      "chmod +x /home/ubuntu/app.py",
      "/home/ubuntu/appenv/bin/python /home/ubuntu/app.py"
    ]
  }
}

This example provisions an entire testing environment automatically:

Creates a VPC, public subnet, internet gateway, and route table
Sets up a security group that allows SSH and app traffic (port 8000)
Provisions an EC2 instance using a key pair and public subnet
Uses file provisioner to copy a local app.py script to the instance
Uses remote-exec provisioner to install dependencies and start a Flask app

This automation enables rapid deployment of test or staging environments during development or CI/CD pipeline runs — a core goal of modern SRE teams.

GitHub Actions Integration Example

In a DevOps workflow, it’s common to automate infrastructure provisioning when code changes occur. Here’s how you can use GitHub Actions to automatically trigger Terraform when app.py is updated:

Create a GitHub Actions workflow in .github/workflows/deploy.yml:

name: Auto Provision EC2 on App Change

on:
  push:
    paths:
      - app.py

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.5.0

      - name: Terraform Init
        run: terraform init

      - name: Terraform Plan
        run: terraform plan -out=tfplan

      - name: Terraform Apply
        run: terraform apply -auto-approve tfplan

      - name: Cleanup Plan File
        run: rm tfplan

This pipeline listens for changes to app.py and automatically runs terraform init, plan, and apply. It ensures that any changes to the application are immediately provisioned onto the EC2 instance via the configured provisioners.

Workspaces for Multi-Environment Support

Workspaces in Terraform allow you to maintain isolated state files for different environments (like dev, staging, prod) using the same configuration.

This solves several problems in SRE practice:

Avoids overwriting infrastructure across environments
Enables safe parallel deployments
Simplifies CI/CD automation with clearly separated states

Workspaces allow you to isolate environments (dev/stage/prod) with separate state files.

Workspace Commands

terraform workspace new dev
terraform workspace select dev
terraform workspace show

This example creates AWS EC2 instance with different instance type according to the workspace by using map() and lookup() functions:

provider "aws" {
  region = "us-west-1"
}

variable "ami_value" {
  description = "AMI ID for the EC2 instance"
  type        = string
}

variable "instance_type_value" {
  description = "Type of the EC2 instance"
  type = map(string)    #use map to allow different instance types for different environments

  default = {
    "dev"   = "t2.micro"
    "stage" = "t2.small"
    "prod"  = "t2.large"
  }
}

module "ec2_instance" {
  source                = "./modules/ec2_instance"
  ami_value             = var.ami_value
  instance_type_value   = lookup(var.instance_type_value, terraform.workspace, "t2.micro") # Use lookup to get the instance type based on the workspace   
}

By using different workspaces for dev/stage/prod environments, state file (terraform.tfstate) are maintain separately as seen in the following folder structure:

wallacelee@imac % tree
.
├── main.tf
├── modules
│   └── ec2_instance
│       └── main.tf
├── stage.tfvars
├── terraform.tfstate.d
│   ├── dev
│   │   ├── terraform.tfstate
│   │   └── terraform.tfstate.backup
│   ├── prod
│   │   ├── terraform.tfstate
│   │   └── terraform.tfstate.backup
│   └── stage
│       ├── terraform.tfstate
│       └── terraform.tfstate.backup
└── terraform.tfvars

7 directories, 10 files

Secret Management with Vault + Terraform

HashiCorp Vault provides a secure way to store and access secrets. Terraform can authenticate to Vault using AppRole and retrieve secrets like API keys, passwords, or bucket names at runtime.

This addresses several critical SRE concerns:

Avoid hardcoding sensitive data in Terraform code or state files
Centralized secret lifecycle management
Enforce access control and auditability for secrets usage

Integrating Vault with Terraform boosts your infrastructure security posture while maintaining automation.

Use Case: Securely retrieve a secret value (e.g., S3 bucket name) from Vault and use it in a resource:

resource "aws_s3_bucket" "example" {
  bucket = data.vault_kv_secret_v2.example.data["s3-bucket-name"]
}

How to Setup Hashicorp Vault:

Install Vault on an EC2 instance.

sudo apt update && sudo apt install gpg

wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg

gpg --no-default-keyring --keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg --fingerprint

echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list

sudo apt update

sudo apt install vault

#start the vault in ec2 instance:

vault server -dev -dev-listen-address="0.0.0.0:8200"

Enable KV (key-pair value) secrets engine: vault secrets enable -path=kv kv-v2
Create a secret: vault kv put kv/myapp s3-bucket-name=wallace-prod-bucket
Create a policy and role:

vault policy write terraform - <<EOF
path "kv/data/*" {
  capabilities = ["create", "read", "update", "delete", "list"]
}
EOF

vault write auth/approle/role/terraform \
    secret_id_ttl=10m \
    token_num_uses=10 \
    token_ttl=20m \
    token_max_ttl=30m \
    secret_id_num_uses=40 \
    token_policies=terraform

Get role_id and secret_id:

vault read auth/approle/role/terraform/role-id
vault write -f auth/approle/role/terraform/secret-id

The following example creates a S3 bucket with bucket name securely retrieved from the secret key s3-bucket-name.

provider "aws" {
  region = "us-west-1"
}

provider "vault" {
  address = "http://x.x.x.x:8200"
  skip_child_token = true

  auth_login {
    path = "auth/approle/login" # Use the AppRole auth method

    parameters = {
      role_id   = "42fb0f72-2c2a-abb9-7b8e-b2f73ac75e83"
      secret_id = "ba45c19c-206e-c201-f454-83f772ba5f40"
    }
  }
}

data "vault_kv_secret_v2" "example" {
  mount = "kv" # Change it according to your mount
  name  = "test-secret" #name of the secret in Vault
}

resource "aws_s3_bucket" "example" {
  bucket = data.vault_kv_secret_v2.example.data["s3-bucket-name"]}

Final Thoughts

Terraform is a must-have tool in any SRE or DevOps engineer’s toolkit. Whether you’re managing complex multi-cloud infrastructure, isolating environments with workspaces, or automating test environments with provisioners, Terraform brings structure, safety, and scalability to your infrastructure operations.

You can find all my Terraform sample scripts here. Thank you for reading my blog and I hope you like it.

MY CLOUD JOURNEY