{"id":885,"date":"2024-09-30T18:49:55","date_gmt":"2024-10-01T01:49:55","guid":{"rendered":"http:\/\/184.72.63.26\/?p=885"},"modified":"2024-11-29T20:45:03","modified_gmt":"2024-11-30T03:45:03","slug":"implementing-aws-x-ray-for-tracing-your-application","status":"publish","type":"post","link":"https:\/\/www.wallacel.com\/index.php\/2024\/09\/30\/implementing-aws-x-ray-for-tracing-your-application\/","title":{"rendered":"Implementing AWS X-Ray for Tracing and Debugging Your Application"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Observability plays a crucial role in understanding, maintaining, and optimizing complex systems in the realm of <strong>Site Reliability Engineering (SRE)<\/strong>. It enables SREs to gain deep insights into the internal workings of systems by analyzing outputs such as metrics, logs, and traces.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When it comes to distributed applications, it&#8217;s essential to ensure that you can observe all the moving parts to quickly diagnose and resolve issues. Observability is built on three key pillars: <strong>metrics<\/strong>, <strong>logs<\/strong>, and <strong>traces<\/strong>. In my previous <a href=\"http:\/\/184.72.63.26\/index.php\/2024\/06\/26\/implement-rag-with-aws-bedrock-and-mongodb-atlas\/\" data-type=\"link\" data-id=\"http:\/\/184.72.63.26\/index.php\/2024\/06\/26\/implement-rag-with-aws-bedrock-and-mongodb-atlas\/\">post<\/a>, I have created a RAG chatbot using AWS Bedrock knowledge base, in this post I will use the chatbot as an example to guide you through setting up tracing using <strong>AWS X-Ray<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Three Pillars of Observability<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"677\" height=\"257\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/three-pillars-1.png\" alt=\"\" class=\"wp-image-938\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/three-pillars-1.png 677w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/three-pillars-1-300x114.png 300w\" sizes=\"auto, (max-width: 677px) 100vw, 677px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">1. <strong>Metrics<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Metrics are numerical representations of system performance over time. These values provide insight into how a system is functioning and can be used to detect anomalies, forecast trends, and set thresholds for alerts. Metrics are typically aggregated over time, such as request latency, CPU usage, or error rates.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2. <strong>Logs<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Logs capture detailed records of events, providing information about what happened at specific points in time. They offer a deep level of insight into the specific actions taken by services, users, and infrastructure. Logs are indispensable for debugging issues and tracing system behaviour.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">3. <strong>Traces<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Traces capture the end-to-end journey of a request through various services in a distributed system. In modern architectures such as microservices, tracing is essential for understanding how requests interact with different components and for pinpointing bottlenecks or failures.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>AWS X-Ray<\/strong> enables distributed tracing by showing the lifecycle of requests across AWS services such as Lambda, API Gateway, SQS, OpenSearch, and more.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Tracing with AWS X-Ray: Why It&#8217;s Important<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">My chatbot involves multiple services like <strong>Lambda functions<\/strong>, <strong>API Gateway<\/strong>, <strong>SQS<\/strong>, <strong>OpenSearch<\/strong>, and <strong>Bedrock<\/strong>, and tracing helps monitor how queries from users traverse through these services. Tracing provides valuable insights into performance issues, error handling, and dependencies, allowing us to optimize the system and ensure reliability.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Key Benefits of Tracing:<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Latency Visualization<\/strong>: Identify slow components in a request lifecycle.<\/li>\n\n\n\n<li><strong>Dependency Mapping<\/strong>: Understand how different services interact with each other.<\/li>\n\n\n\n<li><strong>Error Diagnosis<\/strong>: Identify where and why requests are failing.<\/li>\n\n\n\n<li><strong>End-to-End Monitoring<\/strong>: Track a request from API Gateway to Lambda, SQS, OpenSearch, and other services.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Setting Up AWS X-Ray and AWS Distro for OpenTelemetry Collector for Tracing<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In this section, I\u2019ll guide you through setting up tracing using <strong>AWS X-Ray<\/strong> and <strong>AWS Distro for OpenTelemetry (ADOT)<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>AWS Distro for OpenTelemetry (ADOT)<\/strong> is  AWS-supported distribution of the Cloud Native Computing Foundation (CNCF) OpenTelemetry project. OpenTelemetry (OTel) provides open source APIs, libraries, and agents to collect logs, metrics, and traces.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With ADOT, you can instrument your applications once and send correlated logs, metrics, and traces to one or more observability backends such as Amazon Managed Service for Prometheus, Amazon CloudWatch, AWS X-Ray, Amazon Open Search, any OpenTelemetry Protocol (OTLP) compliant backend, as well as Amazon Managed Streaming for Apache Kafka (MSK).<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"578\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/adot.png\" alt=\"\" class=\"wp-image-886\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/adot.png 1000w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/adot-300x173.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/adot-768x444.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><figcaption class=\"wp-element-caption\">Image source: https:\/\/aws-otel.github.io\/docs\/introduction<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Install ADOT Collector Using ECS<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To collect telemetry data from the AWS environment, I\u2019ll install the <strong>ADOT collector<\/strong> on <strong>ECS<\/strong> using ECS Task Definition. The ADOT collector can gather data such as metrics and traces from different AWS services and send them to <strong>AWS X-Ray<\/strong>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>a. Create new IAM Policy &#8211; AWSDistroOpenTelemetryPolicy<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Create a new IAM Policy name <strong>AWSDistroOpenTelemetryPolicy<\/strong> with the following policy:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"json\" class=\"language-json\">{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"logs:PutLogEvents\",\n                \"logs:CreateLogGroup\",\n                \"logs:CreateLogStream\",\n                \"logs:DescribeLogStreams\",\n                \"logs:DescribeLogGroups\",\n                \"logs:PutRetentionPolicy\",\n                \"xray:PutTraceSegments\",\n                \"xray:PutTelemetryRecords\",\n                \"xray:GetSamplingRules\",\n                \"xray:GetSamplingTargets\",\n                \"xray:GetSamplingStatisticSummaries\",\n                \"cloudwatch:PutMetricData\",\n                \"ec2:DescribeVolumes\",\n                \"ec2:DescribeTags\",\n                \"ssm:GetParameters\"\n            ],\n            \"Resource\": \"*\"\n        }\n    ]\n}<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">b. <strong>Create new IAM Role<\/strong> &#8211; <strong>AWSOTTaskRole<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Create a new IAM Role name <strong>AWSOTTaskRole<\/strong> for running the ADOT Collector in a ECS task, and attach the newly created policy <strong>AWSDistroOpenTelemetryPolicy<\/strong> to it.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"881\" height=\"820\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-72.png\" alt=\"\" class=\"wp-image-887\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-72.png 881w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-72-300x279.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-72-768x715.png 768w\" sizes=\"auto, (max-width: 881px) 100vw, 881px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"484\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-73-1024x484.png\" alt=\"\" class=\"wp-image-888\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-73-1024x484.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-73-300x142.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-73-768x363.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-73.png 1061w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"784\" height=\"612\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-74.png\" alt=\"\" class=\"wp-image-889\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-74.png 784w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-74-300x234.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-74-768x600.png 768w\" sizes=\"auto, (max-width: 784px) 100vw, 784px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">c. <strong>Create new IAM Role<\/strong> &#8211; <strong>TaskExecutionRole<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Create a new IAM Role name <strong>TaskExecutionRole<\/strong> to grant ECS permission to make AWS API calls, and attach these 3 polices:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>AmazonECSTaskExecutionRolePolicy<\/strong><\/li>\n\n\n\n<li><strong>CloudWatchLogsFullAccess<\/strong><\/li>\n\n\n\n<li><strong>AmazonSSMReadOnlyAccess<\/strong><\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"947\" height=\"831\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-75.png\" alt=\"\" class=\"wp-image-890\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-75.png 947w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-75-300x263.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-75-768x674.png 768w\" sizes=\"auto, (max-width: 947px) 100vw, 947px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"803\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-76-1024x803.png\" alt=\"\" class=\"wp-image-891\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-76-1024x803.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-76-300x235.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-76-768x602.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-76.png 1048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">d. <strong>Create new ECS Cluster<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Open the <strong>Amazon ECS Console<\/strong>, choose <strong>Create Cluster<\/strong> and select <strong>EC2 Linux<\/strong> (for EC2-backed tasks) to create a ECS cluster.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1009\" height=\"839\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-77.png\" alt=\"\" class=\"wp-image-893\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-77.png 1009w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-77-300x249.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-77-768x639.png 768w\" sizes=\"auto, (max-width: 1009px) 100vw, 1009px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"733\" height=\"784\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-78.png\" alt=\"\" class=\"wp-image-894\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-78.png 733w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-78-280x300.png 280w\" sizes=\"auto, (max-width: 733px) 100vw, 733px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">e. <strong>Create new ECS Task<\/strong> <strong>for the ADOT Collector<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">After the ECS cluster is created, go to Task definitions in ECS console to create a new ECS task. I will use a task definition template (ecs-ec2-sidecar.json) downloaded from <a href=\"https:\/\/github.com\/aws-observability\/aws-otel-collector\/blob\/main\/examples\/ecs\/aws-cloudwatch\/ecs-ec2-sidecar.json\" data-type=\"link\" data-id=\"https:\/\/github.com\/aws-observability\/aws-otel-collector\/blob\/main\/examples\/ecs\/aws-cloudwatch\/ecs-ec2-sidecar.json\">GitHub<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Update the following parameters in the task definition template:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>{{region}}<\/strong> &#8211; the region the data will be sent to (e.g. us-west-2)<\/li>\n\n\n\n<li><strong>{{ecsTaskRoleArn}}<\/strong> &#8211; <strong>AWSOTTaskRole<\/strong> ARN created in the previous section<\/li>\n\n\n\n<li><strong>{{ecsExecutionRoleArn}}<\/strong> &#8211; <strong>AWSOTTaskExcutionRole<\/strong> ARN created in the previous section<\/li>\n\n\n\n<li><strong>command<\/strong> &#8211; Assign value to the command variable to select the config file path; i.e. <strong>&#8211;config=\/etc\/ecs\/ecs-default-config.yaml<\/strong><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The task requires <strong>1 vCPU<\/strong> and <strong>2GB<\/strong> memory, make sure the EC2 instance type in your AutoScaling Group launch template meet this requirement. Here my ASG is using the <strong>t2.medium<\/strong> instance type with 2 vCPU and 4GB memory.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"733\" height=\"254\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-88.png\" alt=\"\" class=\"wp-image-908\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-88.png 733w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-88-300x104.png 300w\" sizes=\"auto, (max-width: 733px) 100vw, 733px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Once the task definition is created, deploy the task to your ECS cluster.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"843\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-80.png\" alt=\"\" class=\"wp-image-897\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-80.png 960w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-80-300x263.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-80-768x674.png 768w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"964\" height=\"821\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-81.png\" alt=\"\" class=\"wp-image-898\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-81.png 964w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-81-300x255.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-81-768x654.png 768w\" sizes=\"auto, (max-width: 964px) 100vw, 964px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Set Up Tracing for Lambda Functions<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">After setting up the ADOT collector, let\u2019s move on to tracing your <strong>Lambda functions<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">ADOT support auto-instrumention for Lambda function by packaging OpenTelemetry together with an out-of-the-box configuration for AWS Lambda and AWS X-Ray as a Lambda layer. Therefore we can enable OpenTelemetry for our Lambda function without changing any code.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">a. <strong>Enable Active Tracing on Your Lambda Function<\/strong><\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open the <strong>AWS Lambda Console<\/strong>.<\/li>\n\n\n\n<li>Select your<strong> Lambda function<\/strong>.<\/li>\n\n\n\n<li>Under <strong>Monitoring and Operations Tools<\/strong>, enable <strong>Active Tracing<\/strong>. This will send trace data to AWS X-Ray for every Lambda invocation.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"801\" height=\"491\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-84.png\" alt=\"\" class=\"wp-image-903\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-84.png 801w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-84-300x184.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-84-768x471.png 768w\" sizes=\"auto, (max-width: 801px) 100vw, 801px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">b. <strong>Attach the ADOT Lambda Layer<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">To instrument the Lambda function with OpenTelemetry, We will attach a reduced version of the <strong>AWS Distro for OpenTelemetry (ADOT) Lambda layer<\/strong> for use with your lambda function.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>In the <strong>AWS Lambda Console<\/strong>, go to your Lambda function.<\/li>\n\n\n\n<li>Under <strong>Layers<\/strong>, click <strong>Add a Layer<\/strong>.<\/li>\n\n\n\n<li>Choose <strong>Specify an ARN<\/strong> and enter the following ARN for ADOT:<\/li>\n\n\n\n<li><code><strong>arn:aws:lambda:us-west-2:901920570463:layer:aws-otel-python-amd64-ver-1-25-0:1<\/strong><\/code><\/li>\n\n\n\n<li>Deploy your Lambda function.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"801\" height=\"651\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-85.png\" alt=\"\" class=\"wp-image-904\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-85.png 801w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-85-300x244.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-85-768x624.png 768w\" sizes=\"auto, (max-width: 801px) 100vw, 801px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">c. <strong>Add Lambda Environment Variable<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Add the environment variable <strong>AWS_LAMBDA_EXEC_WRAPPER<\/strong> and set it to <strong>\/opt\/otel-instrument<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"791\" height=\"488\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-86.png\" alt=\"\" class=\"wp-image-905\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-86.png 791w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-86-300x185.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-86-768x474.png 768w\" sizes=\"auto, (max-width: 791px) 100vw, 791px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">d. <strong>Add Lambda Execution Role Permission<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Lambda needs the following permissions to send trace data to X-Ray. Add them to your lambda function&#8217;s execution role:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"json\" class=\"language-json\">{\n\t\"Version\": \"2012-10-17\",\n\t\"Statement\": [\n\t\t{\n\t\t\t\"Sid\": \"VisualEditor0\",\n\t\t\t\"Effect\": \"Allow\",\n\t\t\t\"Action\": [\n\t\t\t\t\"xray:PutTelemetryRecords\",\n\t\t\t\t\"xray:PutTraceSegments\"\n\t\t\t],\n\t\t\t\"Resource\": \"*\"\n\t\t}\n\t]\n}<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"471\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-87-1024x471.png\" alt=\"\" class=\"wp-image-906\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-87-1024x471.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-87-300x138.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-87-768x353.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-87.png 1068w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Set Up Tracing for API Gateway<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>API Gateway<\/strong> is often the entry point for requests to your Lambda functions. To trace requests passing through API Gateway, follow these steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Enable X-Ray on API Gateway<\/strong>:\n<ul class=\"wp-block-list\">\n<li>Open the <strong>API Gateway Console<\/strong>.<\/li>\n\n\n\n<li>Select your API.<\/li>\n\n\n\n<li>In the <strong>Stages<\/strong> section, edit <strong>Logs and tracing<\/strong> and check the box to enable <strong>X-Ray Tracing<\/strong>.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Deploy API Gateway<\/strong>:\n<ul class=\"wp-block-list\">\n<li>After enabling X-Ray, deploy your API changes.<\/li>\n\n\n\n<li>This will send traces for incoming requests from API Gateway to Lambda, allowing you to trace the complete request lifecycle.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"752\" height=\"683\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/image-89.png\" alt=\"\" class=\"wp-image-910\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-89.png 752w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/image-89-300x272.png 300w\" sizes=\"auto, (max-width: 752px) 100vw, 752px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: SQS and X-Ray Tracing<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon SQS integrates with AWS X-Ray to propagate tracing headers, allowing trace continuity from the sender to the consumer. When a message is sent to SQS using the AWS SDK, the X-Amzn-Trace-Id is automatically attached. This enables trace context propagation, which can be carried through to the consumer Lambda function, allowing you to track the message&#8217;s journey through the system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To illustrate this auto-propagation, I modified my chatbot lambda function to send a message to SQS message queue everytime when there is a enquiry. Then I create another lambda function to consume the message and we&#8217;ll see how these two traces (i.e. from producer lambda to consumer lambda) are linked together.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"291\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/sqs-1024x291.png\" alt=\"\" class=\"wp-image-911\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/sqs-1024x291.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/sqs-300x85.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/sqs-768x218.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/sqs.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Visualize Traces in AWS X-Ray<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Once everything is set up, you can visualize your traces in <strong>AWS X-Ray<\/strong>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open the <strong>AWS X-Ray Console<\/strong> from the AWS Management Console.<\/li>\n\n\n\n<li>Navigate to the <strong>Service Map<\/strong> to view the interactions between API Gateway, Lambda, SQS, OpenSearch, and Bedrock.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">In this chatbot use case, you can see a client interacts with the chatbot interface through an API call, which makes a lambda function call to a Bedrock Agent to generate responses. You can see in the map below that there is a linked trace after the SQS queue that shows my lambda function (sqs-consumer) consumes the message. This propagation is done automatically by SQS without the need to inject any trace header.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"397\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/map-2-1024x397.png\" alt=\"\" class=\"wp-image-915\" style=\"width:1024px;height:auto\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/map-2-1024x397.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/map-2-300x116.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/map-2-768x298.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/map-2-1536x595.png 1536w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/map-2.png 1788w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"685\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/all-trace-1024x685.png\" alt=\"\" class=\"wp-image-916\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/all-trace-1024x685.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/all-trace-300x201.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/all-trace-768x514.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/all-trace.png 1222w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Click on individual traces to drill down into the specific operations, latency, and any errors that occurred through the point at which the chatbot API is called all the way to the end of the request. We can see a waterfall style diagram of all the different elements involved during the call and how long it took for each of them to call and retrieve a result. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"661\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/trace1-1024x661.png\" alt=\"\" class=\"wp-image-917\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace1-1024x661.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace1-300x194.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace1-768x496.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace1.png 1227w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"380\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/trace2-1024x380.png\" alt=\"\" class=\"wp-image-918\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace2-1024x380.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace2-300x111.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace2-768x285.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace2.png 1260w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Debugging with Tracing<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When a system issue arises, you can use <strong>AWS X-Ray<\/strong> to track the problem through the entire system by examining each trace segment for anomalies.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">1. <strong>Simulate a Failure Scenario<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s simulate a failure by deliberately breaking the Lambda function that queries <strong>OpenSearch<\/strong>. I modify my chatbot lambda function <strong><code>query_opensearch<\/code><\/strong>() to introduce an error:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python\"><code>def query_opensearch(query_text):\n    # Simulate a failure by introducing an invalid operation\n    raise Exception(\"Simulated failure in OpenSearch query\")\n<\/code><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This deliberate failure causes the function to break whenever it attempts to query OpenSearch.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2. <strong>View the Trace in AWS X-Ray<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Now that I\u2019ve broken the function, invoke the Lambda function by sending a query through my chatbot. Then, follow these steps to trace the failure in <strong>AWS X-Ray<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Step 1<\/strong>: Go to the <strong>AWS X-Ray Console<\/strong> and open the <strong>Service Map<\/strong>.<\/li>\n\n\n\n<li><strong>Step 2<\/strong>: Locate the <strong>chatbot<\/strong> node that shows in red.<\/li>\n\n\n\n<li><strong>Step 3<\/strong>: Click on the <strong>chatbot<\/strong> node to view detailed trace data, including the failure point and the error messages.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">In the <strong>Trace Map<\/strong>, you\u2019ll see a visual representation of how the request flowed from <strong>API Gateway<\/strong>, through <strong>Lambda<\/strong>, to <strong>SQS<\/strong>, and to <strong>OpenSearch<\/strong>. Each segment will be color-coded to indicate success, failure, or latency issues. The chatbot function is in red now, indicating that this node is causing the error.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"758\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/trace-500-1024x758.png\" alt=\"\" class=\"wp-image-924\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace-500-1024x758.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace-500-300x222.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace-500-768x569.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/trace-500.png 1067w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">3. <strong>Trace Logs and Debugging<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>X-Ray Trace Detail<\/strong>: Within the trace detail, you&#8217;ll see the exact point where the error occurred, allowing you to pinpoint which service or code segment caused the issue. In our case, the chatbot function is causing the 500 fault code and when we examine the segment in details, we see in the exceptions that the error message is <code>\"<strong>Simulated failure in OpenSearch query<\/strong>\"<\/code>. You can see how useful X-Ray is in helping you debug your application.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"393\" src=\"http:\/\/184.72.63.26\/wp-content\/uploads\/2024\/09\/segment-details-1024x393.png\" alt=\"\" class=\"wp-image-925\" srcset=\"https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/segment-details-1024x393.png 1024w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/segment-details-300x115.png 300w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/segment-details-768x295.png 768w, https:\/\/www.wallacel.com\/wp-content\/uploads\/2024\/09\/segment-details.png 1389w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Limitations of AWS X-Ray<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">While AWS X-Ray supports many services, including <strong>Lambda<\/strong>, <strong>API Gateway<\/strong>, <strong>SQS<\/strong>, <strong>SNS<\/strong>, <strong>S3<\/strong>, etc., there are still limitations with some services like <strong>Kinesis<\/strong> and <strong>DynamoDB Streams<\/strong>, where manual instrumentation is required for full traceability. For services without native X-Ray support, you can propagate trace headers across the service calls by manually instrument the service to ensure consistent visibility across your architecture. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For unsupported services, like DynamoDB Streams or custom HTTP requests, you can manually use ADOT to create and propagate trace context. Below is an example of using OpenTelemetry with AWS X-Ray for tracing a DynamoDB operation:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python\">from opentelemetry import trace\nimport boto3\n\n# Initialize ADOT X-Ray tracing\ntracer = trace.get_tracer(__name__)\n\n# Instrument boto3 for tracing\nBotocoreInstrumentor().instrument()\n\n# Manually instrumenting DynamoDB request with tracing\nwith tracer.start_as_current_span(\"DynamoDB Operation\"):\n    dynamodb_client = boto3.client('dynamodb')\n    dynamodb_client.put_item(TableName=\"MyTable\", Item={\"Key\": {\"S\": \"value\"}})<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Other Limitation of AWS X-Ray &#8211; Sampling Rate<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">AWS X-Ray uses <strong>sampling<\/strong> to limit the amount of data collected and processed to reduce cost and performance overhead. That means it might not capture every request, particularly in high-traffic environments. To mitigate or overcome this limitation, there are several strategies you can use:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Create Custom Sampling Rules<\/strong>: AWS X-Ray allows you to define custom sampling rules to control how often traces are collected based on specific criteria, such as the request path, service name, or resource ARN. This gives you fine-grained control over which requests to sample more frequently.<\/p>\n\n\n\n<ul id=\"block-956f545e-ea80-4b0e-85e2-2e1aa7a311e3\" class=\"wp-block-list\">\n<li><strong>Increase Sampling for Critical Requests<\/strong>: You can configure a higher sampling rate for critical services or API paths (e.g., <code>\/payment<\/code>, <code>\/checkout<\/code>, or API Gateway stages) while reducing the sampling rate for less important or high-traffic requests.<\/li>\n\n\n\n<li><strong>Reduce Sampling for Routine Operations<\/strong>: Set lower sampling rates for common, non-critical requests, like health checks or static asset requests, to reduce unnecessary traces.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This ensures critical parts of your system are always traced, while non-essential requests have reduced sampling.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Conclusion<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In this post, I\u2019ve explored one of three pillars of observability\u2014<strong>traces<\/strong> and demonstrated how to set up <strong>AWS X-Ray<\/strong> and <strong>AWS Distro for OpenTelemetry (ADOT)<\/strong> for tracing in an distributed application. By following these steps, you can gain deep insights into the performance and behaviour of your distributed services, quickly identify and troubleshoot issues, and optimize its reliability.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Thank you for reading my blog and I hope you like it!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Observability plays a crucial role in understanding, maintaining, and optimizing complex systems in the realm of Site Reliability Engineering (SRE). It enables SREs to gain deep insights into the internal workings of systems by analyzing outputs such as metrics, logs, and traces. When it comes to distributed applications, it&#8217;s essential to ensure that you can [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":944,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[55,61,52,60,57,56,50,54,59,58,53],"class_list":["post-885","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-aws","tag-adot","tag-api-gateway","tag-cloudwatch","tag-lambda","tag-logs","tag-metrics","tag-opensearch","tag-opentelemetry","tag-sqs","tag-trace","tag-x-ray"],"_links":{"self":[{"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/posts\/885","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/comments?post=885"}],"version-history":[{"count":26,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/posts\/885\/revisions"}],"predecessor-version":[{"id":1129,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/posts\/885\/revisions\/1129"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/media\/944"}],"wp:attachment":[{"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/media?parent=885"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/categories?post=885"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wallacel.com\/index.php\/wp-json\/wp\/v2\/tags?post=885"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}