Master Amazon EKS Metrics: Automated Collection with AWS Prometheus

Observability at scale is the silent killer of Kubernetes operations. For expert platform engineers, the challenge isn’t just generating Amazon EKS metrics; it is ingesting, storing, and querying them without managing a fragile, self-hosted Prometheus stateful set that collapses under high cardinality.

In this guide, we bypass the basics. We will architect a production-grade observability pipeline using Amazon Managed Service for Prometheus (AMP) and the AWS Distro for OpenTelemetry (ADOT). We will cover Infrastructure as Code (Terraform) implementation, IAM Roles for Service Accounts (IRSA) security patterns, and advanced filtering techniques to keep your metric ingestion costs manageable.

The Scaling Problem: Why Self-Hosted Prometheus Fails EKS

Standard Prometheus deployments on EKS work flawlessly for development clusters. However, as you scale to hundreds of nodes and thousands of pods, the “pull-based” model combined with local TSDB storage hits a ceiling.

  • Vertical Scaling Limits: A single Prometheus server eventually runs out of memory (OOM) attempting to ingest millions of active series.
  • Data Persistence: Managing EBS volumes for long-term metric retention is operational toil.
  • High Availability: Running HA Prometheus pairs doubles your cost and introduces “gap” complexities during failovers.

Pro-Tip: The solution is to decouple collection from storage. By using stateless collectors (ADOT) to scrape Amazon EKS metrics and remote-writing them to a managed backend (AMP), you offload the heavy lifting of storage, availability, and backups to AWS.

Architecture: EKS, ADOT, and AMP

The modern AWS-native observability stack consists of three distinct layers:

  1. Generation: Your application pods and Kubernetes node exporters.
  2. Collection (The Agent): The AWS Distro for OpenTelemetry (ADOT) collector running as a DaemonSet or Deployment. It scrapes Prometheus endpoints and remote-writes data.
  3. Storage (The Backend): Amazon Managed Service for Prometheus (AMP), which is Cortex-based, scalable, and fully compatible with PromQL.

Step-by-Step Implementation

We will use Terraform for the infrastructure foundation and Helm for the Kubernetes components.

1. Provisioning the AMP Workspace

First, we create the AMP workspace. This is the distinct logical space where your metrics will reside.

resource "aws_prometheus_workspace" "eks_observability" {
  alias = "production-eks-metrics"

  tags = {
    Environment = "Production"
    ManagedBy   = "Terraform"
  }
}

output "amp_workspace_id" {
  value = aws_prometheus_workspace.eks_observability.id
}

output "amp_remote_write_url" {
  value = "${aws_prometheus_workspace.eks_observability.prometheus_endpoint}api/v1/remote_write"
}

2. Security: IRSA for Metric Ingestion

The ADOT collector needs permission to write to AMP. We utilize IAM Roles for Service Accounts (IRSA) to grant least-privilege access, avoiding static access keys.

Create an IAM policy AWSManagedPrometheusWriteAccess (or a scoped inline policy) and attach it to a role trusted by your EKS OIDC provider.

data "aws_iam_policy_document" "amp_ingest_policy" {
  statement {
    actions = [
      "aps:RemoteWrite",
      "aps:GetSeries",
      "aps:GetLabels",
      "aps:GetMetricMetadata"
    ]
    resources = [aws_prometheus_workspace.eks_observability.arn]
  }
}

resource "aws_iam_role" "adot_collector" {
  name = "eks-adot-collector-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRoleWithWebIdentity"
      Effect = "Allow"
      Principal = {
        Federated = "arn:aws:iam::${var.account_id}:oidc-provider/${var.oidc_provider}"
      }
      Condition = {
        StringEquals = {
          "${var.oidc_provider}:sub" = "system:serviceaccount:adot-system:adot-collector"
        }
      }
    }]
  })
}

3. Deploying the ADOT Collector

We deploy the ADOT collector using the EKS add-on or Helm. For granular control over the scraping configuration, the Helm chart is often preferred by power users.

Below is a snippet of the values.yaml configuration required to enable the Prometheus receiver and configure the remote write exporter to send Amazon EKS metrics to your workspace.

# ADOT Helm values.yaml
mode: deployment
serviceAccount:
  create: true
  name: adot-collector
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::123456789012:role/eks-adot-collector-role"

config:
  receivers:
    prometheus:
      config:
        global:
          scrape_interval: 15s
        scrape_configs:
          - job_name: 'kubernetes-pods'
            kubernetes_sd_configs:
              - role: pod
            relabel_configs:
              - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                action: keep
                regex: true

  exporters:
    prometheusremotewrite:
      endpoint: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-xxxx/api/v1/remote_write"
      auth:
        authenticator: sigv4auth

  extensions:
    sigv4auth:
      region: "us-east-1"
      service: "aps"

  service:
    extensions: [sigv4auth]
    pipelines:
      metrics:
        receivers: [prometheus]
        exporters: [prometheusremotewrite]

Optimizing Costs: Managing High Cardinality

Amazon EKS metrics can generate massive bills if you ingest every label from every ephemeral pod. AMP charges based on ingestion (samples) and storage.

Filtering at the Collector Level

Use the processors block in your ADOT configuration to drop unnecessary metrics or labels before they leave the cluster.

processors:
  filter:
    metrics:
      exclude:
        match_type: strict
        metric_names:
          - kubelet_volume_stats_available_bytes
          - kubelet_volume_stats_capacity_bytes
          - container_fs_usage_bytes # Often high noise, low value
  resource:
    attributes:
      - key: jenkins_build_id
        action: delete  # Remove high-cardinality labels

Advanced Concept: Avoid including high-cardinality labels such as client_ip, user_id, or unique request_id in your metric dimensions. These explode the series count and degrade query performance in PromQL.

Visualizing with Amazon Managed Grafana

Once data is flowing into AMP, visualization is standard.

  1. Deploy Amazon Managed Grafana (AMG).
  2. Add the “Prometheus” data source.
  3. Toggle “SigV4 SDK” authentication in the data source settings (this seamlessly uses the AMG workspace IAM role to query AMP).
  4. Select your AMP region and workspace.

Because AMP is 100% PromQL compatible, you can import standard community dashboards (like the Kubernetes Cluster Monitoring dashboard) and they will work immediately.

Frequently Asked Questions (FAQ)

Does AMP support Prometheus Alert Manager?

Yes. AMP supports a serverless Alert Manager. You upload your alerting rules (YAML) and routing configuration directly to the AMP workspace via the AWS CLI or Terraform. You do not need to run a separate Alert Manager pod in your cluster.

What is the difference between ADOT and the standard Prometheus Server?

The standard Prometheus server is a monolithic binary that scrapes, stores, and serves data. ADOT (based on the OpenTelemetry Collector) is a pipeline that receives data, processes it, and exports it. ADOT is stateless and easier to scale horizontally, making it ideal for shipping Amazon EKS metrics to a managed backend.

How do I monitor the control plane (API Server, etcd)?

EKS Control Plane metrics are not exposed via standard scraping endpoints inside your VPC because the control plane is managed by AWS. However, you can enable “Control Plane Logging” in EKS to send metrics to CloudWatch, or use specific PromQL exporters if AWS exposes the metrics endpoint (varies by EKS version and configuration).

Master Amazon EKS Metrics: Automated Collection with AWS Prometheus

Conclusion

Migrating to Amazon Managed Service for Prometheus allows expert teams to treat observability as a service rather than a server. By leveraging ADOT for collection and IRSA for security, you build a robust, scalable pipeline for your Amazon EKS metrics.

Your next step is to audit your current metric cardinality using the ADOT processor configuration to ensure you aren’t paying for noise. Focus on the golden signals—Latency, Traffic, Errors, and Saturation—and let AWS manage the infrastructure. Thank you for reading the DevopsRoles page!

,

About HuuPV

My name is Huu. I love technology, especially Devops Skill such as Docker, vagrant, git, and so forth. I like open-sources, so I created DevopsRoles.com to share the knowledge I have acquired. My Job: IT system administrator. Hobbies: summoners war game, gossip.
View all posts by HuuPV →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.