{"id":1229,"date":"2025-09-20T19:01:29","date_gmt":"2025-09-21T02:01:29","guid":{"rendered":"http:\/\/184.72.63.26\/?p=1229"},"modified":"2025-09-22T11:12:27","modified_gmt":"2025-09-22T18:12:27","slug":"monitoring-kubernetes-with-datadog-minikube-otel-demo","status":"publish","type":"post","link":"https:\/\/www.wallacel.com\/index.php\/2025\/09\/20\/monitoring-kubernetes-with-datadog-minikube-otel-demo\/","title":{"rendered":"Monitoring Kubernetes Cluster with Datadog (Minikube + OpenTelemetry Demo)"},"content":{"rendered":"\n<p>This hands-on guide shows how to <strong>monitor a Kubernetes cluster with Datadog<\/strong> using a local <strong>Minikube<\/strong> and the <strong>OpenTelemetry Demo<\/strong> as a realistic microservices showcase. You\u2019ll route telemetry to Datadog via the <strong>Datadog<\/strong> <strong>Agent\u2019s OTLP endpoint<\/strong>, then use Kubernetes Overview\/Explorer, APM, and Error Tracking to troubleshoot common workload failures. By the end, you\u2019ll deploy fast, see metrics\/logs\/traces in one place, and turn fixes into actionable monitors.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Monitoring Kubernetes is Tricky<\/h2>\n\n\n\n<p>Kubernetes is dynamic and noisy, a single mis\u2011typed resource or env var can strand replicas in <strong>Pending<\/strong> or flip them into <strong>CrashLoopBackOff<\/strong>. Host\u2011centric monitoring misses the picture so you need cross\u2011signal visibility tied to Kubernetes context.<\/p>\n\n\n\n<p>What you actually need:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cluster &amp; workload state<\/strong> &#8211; Deployments\/ReplicaSets\/Pods, conditions and <strong>Events<\/strong> to explain <em>why<\/em> something isn\u2019t scheduling<\/li>\n\n\n\n<li><strong>Golden signals<\/strong> &#8211; latency, errors, traffic, and saturation<\/li>\n\n\n\n<li><strong>End\u2011to\u2011end traces<\/strong> &#8211; tie user symptoms to service bottlenecks<\/li>\n\n\n\n<li><strong>Fast pivots<\/strong> between infra \u2192 workload \u2192 services \u2192 code \u2192 logs<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Why Datadog for Monitoring Kubernetes<\/h2>\n\n\n\n<p>Datadog gives you a <strong>Kubernetes-native, cross-signal view<\/strong> &#8211; metrics, logs, traces, and events all in one place with the cluster context like nodes, namespaces, deployments and pods baked in. You can pivot from <strong>infrastructure \u2192 workload \u2192 service \u2192 code<\/strong> in a couple of clicks, correlate symptoms with changes, and drive down Mean Time To Repair (<strong>MTTR)<\/strong> without bouncing between tools.<\/p>\n\n\n\n<p><strong>Kubernetes Overview &amp; Explorer<\/strong> \u2014 Live inventory of clusters, nodes, namespaces, Deployments, and Pods with status, Events, and quick pivots to YAML, logs, and traces.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"653\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/k8s-overview-1024x653.png\" alt=\"\" class=\"wp-image-1239\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/k8s-overview-1024x653.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/k8s-overview-300x191.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/k8s-overview-768x489.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/k8s-overview.png 1177w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>APM + Error Tracking<\/strong> \u2014 Group exceptions by signature, tie spikes to releases and services, jump straight into traces and span\u2011based metrics to see where time is spent.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1020\" height=\"625\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/error-tracking.png\" alt=\"\" class=\"wp-image-1240\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/error-tracking.png 1020w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/error-tracking-300x184.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/error-tracking-768x471.png 768w\" sizes=\"auto, (max-width: 1020px) 100vw, 1020px\" \/><\/figure>\n\n\n\n<p><strong>OpenTelemetry\u2011first<\/strong> \u2014 Send <strong>OTLP<\/strong> directly to the <strong>Datadog Agent<\/strong>; keep vendor\u2011neutral instrumentation and your existing OTel SDKs.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"714\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/agent-1024x714.png\" alt=\"\" class=\"wp-image-1241\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/agent-1024x714.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/agent-300x209.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/agent-768x535.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/agent.png 1181w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Out\u2011of\u2011the\u2011box Kubernetes signals<\/strong> \u2014 KSM Core + Orchestrator Explorer power ready\u2011made dashboards and <strong>monitor templates<\/strong> for crash loops, pending pods, throttling, node pressure, and more.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"671\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/k8s-dashboard-1024x671.png\" alt=\"\" class=\"wp-image-1242\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/k8s-dashboard-1024x671.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/k8s-dashboard-300x197.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/k8s-dashboard-768x504.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/k8s-dashboard.png 1269w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">What is OpenTelemetry<\/h2>\n\n\n\n<p>OpenTelemetry (OTel) is the open standard for app telemetry (traces, metrics, logs) with portable SDKs and the <strong>OTLP<\/strong> wire protocol. Instrument once, keep your options open, and forward wherever you need.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Two common wiring patterns to Datadog<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Approach&nbsp;#1 \u2014 OTel Collector \u2192 Datadog<\/strong><br>The OTel Demo ships with an embedded <strong>Collector<\/strong>. Point it at Datadog using the <strong>Datadog exporter<\/strong> <strong>+ extension<\/strong>. <br>Pros: powerful pipelines\/processors, easy fan\u2011out to multiple backends.<br><\/li>\n\n\n\n<li><strong>Approach&nbsp;#2 \u2014 Datadog Agent (OTLP ingest)<\/strong><br>Deploy the <strong>Datadog Operator<\/strong> and have services send OTLP <strong>directly to the Datadog Agent<\/strong> on 4317\/4318. <br>Pros: fewer moving parts, instant Kubernetes inventory via Kube State Metrics Core + Orchestrator Explorer, one in-cluster Service to target.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Why I pick the Datadog Agent approach (#2)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Production\u2011like<\/strong>: most teams already run the Datadog Agent, pointing OTLP at it is a small, low-risk change.<\/li>\n\n\n\n<li><strong>OTel\u2011first<\/strong>: you instrument your code with OpenTelemetry SDKs, keeping vendor-neutral instrumentation while still leveraging Datadog&#8217;s platform features.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Prerequisites<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>macOS\/Linux shell<\/li>\n\n\n\n<li><strong>Docker<\/strong> <strong>desktop<\/strong>,<strong> minikube<\/strong>, <strong>kubectl<\/strong>, <strong>Helm<\/strong><\/li>\n\n\n\n<li>Datadog account + <strong><a href=\"https:\/\/app.datadoghq.com\/organization-settings\/api-keys\" data-type=\"link\" data-id=\"https:\/\/app.datadoghq.com\/organization-settings\/api-keys\">Datadog API key<\/a><\/strong><\/li>\n<\/ul>\n\n\n\n<p>I am running a minikube with Docker driver, with 8 CPU cores and 8 GB of RAM allocated to the cluster:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash\">minikube start --driver=docker --memory=8192 --cpus=8<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Install the Datadog Operator &amp; Agent (OTLP ingest)<\/h2>\n\n\n\n<p>I\u2019ll deploy the <strong>Datadog<\/strong> <strong>Operator<\/strong> and a <strong>DatadogAgent<\/strong> CR that exposes the Agent\u2019s OTLP receivers (4317\/4318). Then we\u2019ll point the OTel Demo services directly at the Agent.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Create a secret to store Datadog API key and site<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash\">kubectl create ns datadog-operator\nkubectl -n datadog-operator create secret generic datadog-secret \\\n--from-literal=\"DD_API_KEY=&lt;YOUR_DD_API_KEY&gt;\" \\\n--from-literal=\"DD_SITE_PARAMETER=datadoghq.com\"<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">2) Install the Datadog Operator<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash\">helm repo add datadog https:\/\/helm.datadoghq.com\nhelm repo update\nhelm upgrade --install datadog-operator datadog\/datadog-operator -n datadog-operator<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">3) Configure Datadog Agent via custom resource<\/h3>\n\n\n\n<p>Create a file named <code>datadog-agent-minikube.yaml<\/code> with the following content to configure Datadog Agent:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Global config<\/strong>: sets the cluster name (<code>minikube<\/code>) and site (<code>datadoghq.com<\/code>), and pulls the API key from the <code>datadog-secret<\/code> (<code>DD_API_KEY<\/code>).<\/li>\n\n\n\n<li><code>otelCollector.enabled: true<\/code> \u2014 turns on the Agent\u2019s <strong>OTLP<\/strong> receivers (4317\/4318) so our OTel Demo app can send traces\/metrics\/logs directly to the Agent.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"yaml\" class=\"language-yaml\">kind: \"DatadogAgent\"\napiVersion: \"datadoghq.com\/v2alpha1\"\nmetadata:\n  name: \"datadog\"\n  namespace: datadog-operator\nspec:\n  global:\n    site: \"datadoghq.com\"\n    credentials:\n      apiSecret:\n        secretName: \"datadog-secret\"\n        keyName: \"DD_API_KEY\"\n    tags:\n      - \"env:minikube\"\n    kubelet:\n      tlsVerify: false   \n    clusterName: \"minikube\"   \n  features:\n    logCollection:\n      enabled: true\n      containerCollectAll: true\n    otelCollector:\n      enabled: true\n    kubeStateMetricsCore:\n      enabled: true\n    orchestratorExplorer:\n      enabled: true\n<\/code><\/pre>\n\n\n\n<p>After Datadog Operator spins up the Agent, our OTel Demo services can point to it to stream telemetry to Datadog. Apply the change:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash\">kubectl apply -f datadog-agent-minikube.yaml\n\nkubectl -n datadog-operator get svc datadog-agent -o wide\n# Expect ports 4317 (grpc) and 4318 (http)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Configure &amp; Deploy the OTel Demo<\/h2>\n\n\n\n<p>We\u2019ll stand up the <strong>OpenTelemetry Demo<\/strong> on Kubernetes so you have a realistic microservices playground. We\u2019ll modify the deployment with a custom values file that wire telemetry to Datadog via the <strong>Datadog Agent OTLP ingest<\/strong> endpoint.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Create a namespace and add the Helm repo<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash\">kubectl create ns otel-dd\nhelm repo add open-telemetry https:\/\/open-telemetry.github.io\/opentelemetry-helm-charts\nhelm repo update<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">2) Configure the OTel demo (lightweight footprint)<\/h3>\n\n\n\n<p>This patches <strong>app<\/strong> services only (not infra like Kafka\/Redis\/loadgen) and points them to Datadog  Agent\u2019s in\u2011cluster OTLP gRPC endpoint. It also disables built\u2011ins (i.e. Grafana, Jaeger, Prometheus, Opensearch) we won\u2019t use in this datadog approach.<\/p>\n\n\n\n<p>Create a file named&nbsp;<code>values-otlp-dd-subset.yaml<\/code>&nbsp;with the following content:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"yaml\" class=\"language-yaml\"># Keep the demo small\njaeger: { enabled: false }\nprometheus: { enabled: false }\ngrafana: { enabled: false }\nopensearch: { enabled: false }\n\n_globals:\n  otlpEnv: &amp;otlpEnv\n    - name: OTEL_EXPORTER_OTLP_ENDPOINT\n      value: \"http:\/\/datadog-agent.datadog-operator.svc.cluster.local:4317\"\n    - name: OTEL_EXPORTER_OTLP_PROTOCOL\n      value: \"grpc\"\n    - name: OTEL_TRACES_EXPORTER\n      value: \"otlp\"\n    - name: OTEL_METRICS_EXPORTER\n      value: \"otlp\"\n    - name: OTEL_LOGS_EXPORTER\n      value: \"otlp\"\n\nadservice: { envOverrides: *otlpEnv }\ncartservice: { envOverrides: *otlpEnv }\ncheckoutservice: { envOverrides: *otlpEnv }\ncurrencyservice: { envOverrides: *otlpEnv }\nemailservice: { envOverrides: *otlpEnv }\nfrontend: { envOverrides: *otlpEnv }\nfrontendproxy: { envOverrides: *otlpEnv }\nimageprovider: { envOverrides: *otlpEnv }\npaymentservice: { envOverrides: *otlpEnv }\nproductcatalogservice: { envOverrides: *otlpEnv }\nquoteservice: { envOverrides: *otlpEnv }\nrecommendationservice: { envOverrides: *otlpEnv }\nshippingservice: { envOverrides: *otlpEnv }\n\n# keep loadgen off by default\nloadgenerator:\n  replicaCount: 0<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">3) Install the OTel demo with the values file<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash\">helm upgrade --install shop-dd open-telemetry\/opentelemetry-demo \\\n-n otel-dd -f values-otlp-dd-subset.yaml<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">3) Access the Otel Astronomy Shop UI &amp; tools<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash\">kubectl -n otel-dd port-forward svc\/frontend-proxy 8080:8080<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Astronomy Shop UI: <a>http:\/\/localhost:8080\/<\/a><\/li>\n\n\n\n<li>Feature Flags UI: <a>http:\/\/localhost:8080\/feature\/<\/a><\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1004\" height=\"750\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/otel-demo.png\" alt=\"\" class=\"wp-image-1236\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/otel-demo.png 1004w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/otel-demo-300x224.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/otel-demo-768x574.png 768w\" sizes=\"auto, (max-width: 1004px) 100vw, 1004px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"768\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/otel-feature-1024x768.png\" alt=\"\" class=\"wp-image-1237\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/otel-feature-1024x768.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/otel-feature-300x225.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/otel-feature-768x576.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/otel-feature.png 1046w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Tip<\/strong>: pause the load generator to keep a small footprint for our local setup, we can manually browse the webstore and generate traffic.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash\">kubectl -n otel-dd scale deploy load-generator --replicas=0<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Explore OpenTelemetry Data in Datadog<\/h3>\n\n\n\n<p>Open <strong>APM &gt; Software Catalog<\/strong> to view all running services of the OTel demo<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"658\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/service-1024x658.png\" alt=\"\" class=\"wp-image-1244\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/service-1024x658.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/service-300x193.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/service-768x493.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/service.png 1147w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Switch to <strong>Map<\/strong> view and change Map layout to <strong>Flow<\/strong> to see the service topology and dependencies<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"757\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/map-flow-1-1024x757.png\" alt=\"\" class=\"wp-image-1246\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/map-flow-1-1024x757.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/map-flow-1-300x222.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/map-flow-1-768x568.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/map-flow-1.png 1134w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Select a service (e.g. <strong>recommendation<\/strong>) to view its performance metrics (like requests throughput, latency, errors) in a side panel without leaving the page<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"578\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/service-metrics-1024x578.png\" alt=\"\" class=\"wp-image-1249\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/service-metrics-1024x578.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/service-metrics-300x169.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/service-metrics-768x434.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/service-metrics.png 1450w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Open <strong>APM &gt; Traces<\/strong> and filter by the <strong>frontend<\/strong> service (the Astronomy shop UI)<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"627\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/trace-1024x627.png\" alt=\"\" class=\"wp-image-1251\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/trace-1024x627.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/trace-300x184.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/trace-768x471.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/trace.png 1425w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Open a trace and select a span of interest to view the <strong>end-to-end trace details<\/strong>. The waterfall timeline shows each hop the request <strong>traverses<\/strong> and how much time is spent on each, which makes it ideal for troubleshooting performance issues (e.g., slow services, chatty calls, downstream errors).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"535\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/span-1024x535.png\" alt=\"\" class=\"wp-image-1252\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/span-1024x535.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/span-300x157.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/span-768x401.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/span.png 1442w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting Workloads<\/h3>\n\n\n\n<p>Datadog provides lots of tools to help you monitor and troubleshoot workloads running in Kubernetes clusters. The <strong>Kubernetes Overview<\/strong> provides a high-level view of your cluster\u2019s health and performance. It\u2019s a great starting point for troubleshooting, as it shows the summary status of your cluster\u2019s nodes, and pods.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"692\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/kubernetes-overview-1024x692.png\" alt=\"\" class=\"wp-image-1256\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/kubernetes-overview-1024x692.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/kubernetes-overview-300x203.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/kubernetes-overview-768x519.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/kubernetes-overview.png 1179w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Look at the <strong>Troubleshooting Patterns<\/strong> section and hover over <strong>Pods in symptomatic phases<\/strong> graph, there is one pod that is stuck in <strong>pending<\/strong> state.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"611\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/troubleshoot-patterns-3-1024x611.png\" alt=\"\" class=\"wp-image-1266\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/troubleshoot-patterns-3-1024x611.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/troubleshoot-patterns-3-300x179.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/troubleshoot-patterns-3-768x458.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/troubleshoot-patterns-3.png 1094w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Navigate to <strong>Kubernetes Explorer<\/strong>, click on <strong>Pods<\/strong> and see the <strong>Pending<\/strong> Pod Group, you can tell that there&#8217;s an issue with the <strong>ad<\/strong> service. Hover over the PENDING status to see why the pod is stuck &#8211; <strong>Insufficient memory<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"858\" height=\"362\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/pod-group-pending.png\" alt=\"\" class=\"wp-image-1268\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/pod-group-pending.png 858w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/pod-group-pending-300x127.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/pod-group-pending-768x324.png 768w\" sizes=\"auto, (max-width: 858px) 100vw, 858px\" \/><\/figure>\n\n\n\n<p>Click <strong>Investigate<\/strong> to open the pod&#8217;s <strong>Troubleshooter<\/strong> tab. It gives you more comprehensive explanation of the scheduling problem. In this case our minikube cluster needs more memory resources to run the advertisement service.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"903\" height=\"479\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/troubleshooter.png\" alt=\"\" class=\"wp-image-1269\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/troubleshooter.png 903w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/troubleshooter-300x159.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/troubleshooter-768x407.png 768w\" sizes=\"auto, (max-width: 903px) 100vw, 903px\" \/><\/figure>\n\n\n\n<p>With the <strong>ad service<\/strong> down, the webstore\u2019s ad banner renders blank.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"638\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/ad-service-fail-1024x638.png\" alt=\"\" class=\"wp-image-1288\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/ad-service-fail-1024x638.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/ad-service-fail-300x187.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/ad-service-fail-768x478.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/ad-service-fail.png 1317w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Let&#8217;s open the <strong>Kubernetes Cluster Overview<\/strong> <strong>Dashboard<\/strong> to confirm whether the cluster has enough memory resources to run the workload.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"597\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/kubernetes-overview-dashboard-1024x597.png\" alt=\"\" class=\"wp-image-1270\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/kubernetes-overview-dashboard-1024x597.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/kubernetes-overview-dashboard-300x175.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/kubernetes-overview-dashboard-768x448.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/kubernetes-overview-dashboard.png 1144w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Hover over the allocatable line in <strong>Cluster Memory Capacity<\/strong> graph, you can see that the cluster has 25.2G memory available but only 3.65G memory was being requested. So there should be enough capacity for the <strong>ad<\/strong> service.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"751\" height=\"171\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/allocatable-memory-1.png\" alt=\"\" class=\"wp-image-1272\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/allocatable-memory-1.png 751w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/allocatable-memory-1-300x68.png 300w\" sizes=\"auto, (max-width: 751px) 100vw, 751px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"752\" height=\"168\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/requests-memory.png\" alt=\"\" class=\"wp-image-1273\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/requests-memory.png 752w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/requests-memory-300x67.png 300w\" sizes=\"auto, (max-width: 752px) 100vw, 752px\" \/><\/figure>\n\n\n\n<p>Our cluster is healthy and has enough memory to run the ad service. Why it is reported that the node has insufficient memory? Let&#8217;s look at the pod&#8217;s spec to confirm that it&#8217;s requesting a reasonable amount of memory. Navigate to <strong>Kubernetes Explorer > Pod<\/strong>, use the facets panel on the left click <strong>Pending<\/strong> in <strong>Status<\/strong> to show only resources in pending state.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"494\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/pending-pod-1-1024x494.png\" alt=\"\" class=\"wp-image-1274\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/pending-pod-1-1024x494.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/pending-pod-1-300x145.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/pending-pod-1-768x371.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/pending-pod-1.png 1150w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Click on the <strong>pending<\/strong> pod (i.e. <strong>ad-594fddd6.<\/strong>) to see the details. Navigate to the <strong>YAML<\/strong> tab to inspect its metadata and spec. When we scroll down to the resources section, we can see that the pod is configured to request <strong>300Gi<\/strong> memory, which far exceeds the allocatable memory in our cluster, and clearly this is a typo.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"667\" height=\"249\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/yaml-spec-1.png\" alt=\"\" class=\"wp-image-1276\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/yaml-spec-1.png 667w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/yaml-spec-1-300x112.png 300w\" sizes=\"auto, (max-width: 667px) 100vw, 667px\" \/><\/figure>\n\n\n\n<p>To fix the issue, we can modify the manifest for the <strong>ad<\/strong> workload by changing both memory requests and limits to the correct settings (i.e. <strong>300Mi<\/strong>)<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash\">kubectl set resources deploy\/ad --requests=memory=300Mi --limits=memory=300Mi -n otel-dd<\/code><\/pre>\n\n\n\n<p>Return to the <strong>Kubernetes Cluster Overview<\/strong> <strong>Dashboard<\/strong>, under the <strong>Deployment<\/strong> section we can confirm the issue is fixed and all pods are up and running because the number of <strong>Desired<\/strong> pods equal to that of <strong>Available<\/strong> (i.e. <strong>23<\/strong>)<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"607\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/deployment-dashboard-1-1024x607.png\" alt=\"\" class=\"wp-image-1280\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/deployment-dashboard-1-1024x607.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/deployment-dashboard-1-300x178.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/deployment-dashboard-1-768x455.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/deployment-dashboard-1.png 1153w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The ad banner is back online and displaying again!<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"638\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/ad-service-1-1024x638.png\" alt=\"\" class=\"wp-image-1289\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/ad-service-1-1024x638.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/ad-service-1-300x187.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/ad-service-1-768x478.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/ad-service-1.png 1317w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Create a Monitor to stay alert<\/h3>\n\n\n\n<p>Let&#8217;s create a monitor and get alerts if similar issues recur to help us fix them more quickly. Datadog provides many built-in monitoring templates. Navigate to <strong>Kubernetes Overview &gt; Monitors Templates<\/strong> section:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"459\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/monitor-template-1024x459.png\" alt=\"\" class=\"wp-image-1290\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-template-1024x459.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-template-300x135.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-template-768x345.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-template.png 1150w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Locate <strong>Kubernetes Deployment Replicas are failing<\/strong> monitor and click <strong>Configure<\/strong>.  This opens a pre-configured metric monitor that track pods in pending state.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"743\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/monitor-pending-1-1024x743.png\" alt=\"\" class=\"wp-image-1294\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-pending-1-1024x743.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-pending-1-300x218.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-pending-1-768x557.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-pending-1.png 1060w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Under Define the metric, the metric query for this monitor is the difference between the number of deployment replicas <strong>desired<\/strong> vs number <strong>available<\/strong>. You can see from the graph that the <strong>kube_deployment:ad pod<\/strong> was at <strong>1<\/strong> (i.e. when the pod was stuck at <strong>pending<\/strong> state) before and now it has returned to <strong>0<\/strong>. <\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"682\" height=\"206\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/alert-graph.png\" alt=\"\" class=\"wp-image-1295\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/alert-graph.png 682w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/alert-graph-300x91.png 300w\" sizes=\"auto, (max-width: 682px) 100vw, 682px\" \/><\/figure>\n\n\n\n<p>This is a useful metric to track for any pod. We can also configure the alert to send out notification via email, text, or Slack to the right team members so they can look at the issue right away when the alert is triggered.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"753\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/alert-notification-1024x753.png\" alt=\"\" class=\"wp-image-1296\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/alert-notification-1024x753.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/alert-notification-300x220.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/alert-notification-768x564.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/alert-notification.png 1060w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Navigate to the <strong>Monitors<\/strong> page and you can see your new monitor as a list:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"290\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2025\/09\/monitor-list-1024x290.png\" alt=\"\" class=\"wp-image-1297\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-list-1024x290.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-list-300x85.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-list-768x218.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2025\/09\/monitor-list.png 1066w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Now you\u2019ve got guardrails in place. If a similar issue pops up, your team will hear about it fast.<\/p>\n\n\n\n<p>To summarize, in this troubleshooting scenario we:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Used&nbsp;<strong>Kubernetes Overview<\/strong>&nbsp;and&nbsp;<strong>Kubernetes Explorer<\/strong>&nbsp;to identify a pod stuck in&nbsp;<code><strong>Pending<\/strong><\/code>&nbsp;due to memory constraints.<\/li>\n\n\n\n<li>Verified overall capacity in the <strong>Kubernetes Cluster Overview<\/strong> dashboard.<\/li>\n\n\n\n<li>Inspected the Pod spec in&nbsp;<strong>Kubernetes Explorer<\/strong>&nbsp;and found a typo in the memory request.<\/li>\n\n\n\n<li>Fixed the manifest and redeployed the ad service, restoring the banner.<\/li>\n\n\n\n<li>Created a <strong>Monitor<\/strong> so any future unschedulable Pods trigger an alert.<\/li>\n<\/ol>\n\n\n\n<p>Thanks you reading and I hope this walkthrough was useful.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This hands-on guide shows how to monitor a Kubernetes cluster with Datadog using a local Minikube and the OpenTelemetry Demo as a realistic microservices showcase. You\u2019ll route telemetry to Datadog via the Datadog Agent\u2019s OTLP endpoint, then use Kubernetes Overview\/Explorer, APM, and Error Tracking to troubleshoot common workload failures. By the end, you\u2019ll deploy fast, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1307,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[78],"tags":[79,23],"class_list":["post-1229","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-datadog","tag-datadog","tag-kubernetes"],"_links":{"self":[{"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/posts\/1229","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/comments?post=1229"}],"version-history":[{"count":28,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/posts\/1229\/revisions"}],"predecessor-version":[{"id":1315,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/posts\/1229\/revisions\/1315"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/media\/1307"}],"wp:attachment":[{"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/media?parent=1229"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/categories?post=1229"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/tags?post=1229"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}