DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Monitoring and Observability

Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.

icon
Latest Premium Content
Trend Report
Observability and Performance
Observability and Performance
Refcard #290
Getting Started With Log Management
Getting Started With Log Management
Refcard #368
Getting Started With OpenTelemetry
Getting Started With OpenTelemetry

DZone's Featured Monitoring and Observability Resources

Observability for Agents and Workflows: Tracing Prompts, Tool Calls, and Business Outcomes End-to-End

Observability for Agents and Workflows: Tracing Prompts, Tool Calls, and Business Outcomes End-to-End

By Srinivas Chippagiri DZone Core CORE
AI agents have come a long way. They aren’t just answering simple questions, but they’re handling order checks, summarizing support tickets, updating records, routing incidents, approving requests, and even calling internal tools. As these agents slip deeper into real business workflows, just peeking at model logs isn’t enough. Teams need to see everything: what the agent did, why it did it, which systems it poked, and whether the end result actually helped the business. Agent Observability That’s where agent observability comes in. Traditional observability lets teams watch over their apps, APIs, databases, and infrastructure. Agent observability goes a step further. It shines a light on the whole AI workflow: it connects the dots from the user’s request to the agent’s decisions, the tools it touches, the systems it interacts with, and all the way to the final outcome. Let’s see a customer support example. Say a customer messages, “My subscription renewal failed, but I got charged twice.” A human rep checks the account, payment history, billing rules, refund policy, and ticket history before answering. Now, an AI agent might do that job automatically. It’ll spot the billing problem, look up the customer record, call the billing system, check for duplicate payments, and either resolve the issue or escalate it if things get too messy. On the surface, this whole thing just looks like a simple chat. However, under the hood, it’s a full-on workflow. If you want good observability, you need that behind-the-scenes view: Why bother? Because the final response doesn’t tell you the whole story. If the customer comes back unhappy, you need to nail down whether the agent checked the right account, used the right billing tool, hit an error, misread the request, or escalated when it couldn’t help. Don’t just watch the answer: Follow the whole journey When you break down agent interactions, a few basic layers show the full picture. First, track the user request. What did the user ask? Was it urgent, fuzzy, sensitive, or bound to a customer contract? Second, watch the agent’s action. Did it answer straight away, ask a follow-up question, search a knowledge base, use a tool, or hand off to a human? Third, note the context. What sort of information did it use? Did it pull a help article, customer details, invoice, ticket, policy, or product data? Fourth, log tool usage. Did the agent call billing APIs, CRM systems, databases, incident tools, or an approval workflow? Did those calls work, or did they fail? Lastly, look at the result. Did the agent fix the customer’s problem? Was the ticket reopened? Did a human have to clean up after the agent? Without these layers, you’ll know when something was slow or incorrect, but not why. Maybe the context was off, a tool call failed, it lacked permissions, the prompt changed, or something further downstream broke. Use a Single ID to Track Everything One of the easiest fixes is to tag the whole workflow with a tracking ID. Let that ID travel with the request, from the interface through the agent, tools, APIs, and your business systems. Now, if a support ticket gets botched, the team can retrace every step: what the customer asked, what the agent understood, which account it checked, what the billing system said back, and why the agent chose to close or escalate. It’s not just for support. Maybe your SRE team uses an AI agent to help dig into a production alert. The agent scans logs, checks recent deployments, reviews database metrics, and suggests the likely cause. That same tracking ID means you’ll know exactly which systems the agent checked and whether it missed anything crucial. Don’t ignore tool calls; they’re real actions Here’s where things get serious. When an agent calls a tool, it’s taking action. Looking up customers, updating records, approving requests, creating tickets, and kicking off workflows need to be watched closely. For each tool call, capture details like tool name, how long it took, success or failure, retries, permission results, error messages, and what actually happened. Take a finance workflow. Say the agent reviews vendor invoices by extracting details, matching with a purchase order, checking taxes, and routing exceptions to finance. If an invoice gets approved by mistake, did the agent misread the invoice? Match it with the wrong purchase order? Miss a policy update? Or did the finance system return incomplete info? That’s why tracking tool calls is critical. A wrong answer in chat is one thing, but a wrong move in your business system can lead to trouble such as money lost, operations disrupted, and even compliance issues. Understand Agent Decisions, But Protect Privacy Teams need to understand what the agent did, but you don’t want to log every single “thought” it had; it’s just unnecessary noise. Instead, record decision details in a structured way. Example: Intent: billing disputeConfidence: mediumTool: billing lookupReason: account verification neededPolicy result: escalateFinal action: handoff to human Now you have enough to debug the workflow and for reporting, without exposing raw thought streams. You can spot how often agents escalate from low confidence, where tools fail, or if policy rules stop an action. Connect Observability to Business Outcomes Don’t just track the tech stuff; what really matters is whether the agent gets the job done. Watch business metrics like: Resolution timeEscalation rateWorkflow completion rateTool failuresCost per workflowSLA hits or missesReworkHow often humans step in If you’ve got an e-commerce agent helping buyers pick products, check inventory, apply discounts, and guide checkout, you want to know: did the customer actually buy the item? If checkout drops after you tweak a prompt, find out why. Did the agent push out-of-stock items? Apply discounts wrong? Use the wrong tool? Lose customers with confusing answers? Observability at this level helps both engineering and business teams get answers, fast. Build Dashboards for Different Audiences Everyone’s got different needs. SREs care about latency, failed tools, retries, issues with dependencies, and expensive cost spikes. Security teams focus on policy denials, suspicious tool actions, sensitive data flags, or prompt injection attempts. Product owners want completion rates, escalations, customer satisfaction, and abandoned workflows. Engineers need to see how agent behavior shifts after you change the model, prompt, workflow, or deployment. Business folks need throughput, SLAs, cost savings, and improvements to customer experience. Take security operations. Say an agent checks suspicious logins, identity logs, privilege changes, and endpoint activity. Security needs to know: did the agent just review info, or did it try to lock an account? If it got blocked, you want that visible, too. Alert on AI-Specific Failures AI agents fail in new ways. Teams need alerts for things like sudden spikes in tool denials, fallback responses, unexpected tool usage, cost blowups, prompt injection attempts, completion drops, or escalating cases. If an agent suddenly goes wild with refund actions, it could mean a prompt is off, a policy is weak, or something’s getting abused. If fallback responses shoot up, maybe the knowledge base is broken. Costs spike? Maybe the agent is stuck looping, retrying, or making unnecessary expensive calls. Tie alerts to deployments, too. Agents change behavior after you update a prompt, switch models, change schema, adjust policies, or edit a workflow. Teams should compare how the agent behaved before and after. A Simple Way to Grow Observability Observability matures in steps. Basic logs: prompts, responses, errors, timestampsTool visibility: what got used, if it worked, how long it tookEnd-to-end traces: follow the user request through the agent, tools, APIs, systemsBusiness-level result tracking: resolution, escalation, completion, rework, cost, SLAAutomated alerts: regressions after updates, anomalies, unusual patterns Observability is more about making sense of the whole workflow and visibility. Teams need to know what users wanted, what the agent decided, which info it used, which tools it grabbed, which systems it touched, and whether business value was delivered. As AI agents settle into production, observability has to cover more than just servers and app logs. The teams that win will be the ones who trace agent behavior end to end, spot failures early, explain what happened, and keep improving safely. More
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 2

Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 2

By Sangharsh Agarwal
What This Series Is About This is Part 2 of a two-part series on building a Slack bot that answers natural language questions about a GitHub repository using AWS Bedrock (Claude) and GitHub's official Model Context Protocol (MCP) server. Part 1 covered the why: most AI tools suggest wrapping GitHub's REST API and feeding the response to a model. That approach works, but it produces brittle glue code that grows with every new question type and every new data source. MCP offers a fundamentally better pattern — a tool registry that the model queries at runtime, making routing decisions autonomously. The result is a 150-line bot that answers questions you never anticipated and extends to new data sources with four lines of configuration. If you have not read Part 1, it is available here: https://dzone.com/articles/build-a-github-slack-bot-with-aws-bedrock-and-mcp. The full project code is on GitHub: https://github.com/sangharshcs/slack-github-mcp-bot. This article covers the implementation — the four key architectural pieces, how to get it running, how to extend it to new MCP servers, and the production lessons from running it on a real engineering team. How It Is Built — The Four Key Pieces The bot has four distinct components. Understanding each one separately makes the whole system easier to reason about and extend. 1. The MCP Request Function All communication with GitHub's MCP server goes through a single function. GitHub MCP returns Server-Sent Events (SSE) rather than plain JSON, so the function handles both response types transparently. It also checks HTTP status and surfaces MCP-level errors cleanly — without this, a 401 or 500 from the server fails silently. The function signature accepts the endpoint and headers as parameters, not hardcoded values. This is the detail that makes the whole system extensible: the same function routes to GitHub today and to any other MCP server tomorrow. 2. The Tool Registry At startup, the bot calls tools/list on every connected MCP server and records which server owns each tool. This registry — a simple JavaScript object mapping tool name to endpoint and auth headers — is the entire routing mechanism. When Claude calls a tool, the bot looks up its origin and sends the request there. Adding a new MCP server means calling the same loadServer() function with the new URL and credentials. The registry grows. The agent loop never changes. This four-line pattern is the extensibility mechanism Eric described as worth expanding on: JavaScript // Same pattern for every MCP server you add: const myServiceHeaders = { Authorization: `Bearer ${process.env.MY_SERVICE_TOKEN}`, 'Content-Type': 'application/json', Accept: 'application/json, text/event-stream', }; await loadServer(process.env.MY_SERVICE_MCP_URL, myServiceHeaders); // Then add routing guidance to your system prompt. // The agent loop below does not change. 3. The Agent Loop The loop sends the question to Claude with the full tool list. Claude selects tools, the bot executes them via the registry, results return to Claude, and the cycle repeats until Claude has enough to answer — typically 3 to 8 tool calls. The loop is generic: it does not know whether it is answering a bug or a PR question. The system prompt configures the behavior. The same code handles every question type, present and future. 4. The System Prompt The system prompt is the highest-leverage piece in the entire system. The difference between a bot your team uses daily and one they quietly stop using is almost always prompt quality, not code quality. Three rules matter most: Explicit Slack markdown syntax. Claude defaults to standard Markdown. Without being told otherwise, it uses **bold** and [text](url), which Slack renders as raw asterisks and broken links. The prompt must specify *bold*, <url|text>, no # headings, no markdown tables.High-volume handling. Without a rule, asking 'list all open issues' on a large repo returns an unusable wall of text. The prompt should specify: if results exceed 15 items, summarise by category and show the top 10.Tool routing for multiple servers. When you add a second MCP server, the prompt tells Claude which questions map to which server. This reduces unnecessary tool calls and keeps responses fast.The complete index.js, package.json, and .env template are in the project repository at https://github.com/sangharshcs/slack-github-mcp-bot. Getting It Running Setup involves three external services — Slack, GitHub, and AWS Bedrock — each requiring a token. Rather than reproducing the full step-by-step here (that lives in the project README at https://github.com/sangharshcs/slack-github-mcp-bot), here is what each token is and where to get it. The Slack bot token (xoxb-...) comes from creating a Slack app at api.slack.com/apps with Socket Mode enabled. Socket Mode is what lets the bot run from your laptop without a public URL — it connects outbound over WebSocket. You also need an App-Level Token (xapp-...) for the socket connection itself, and a Signing Secret from Basic Information. The bot needs these scopes: app_mentions:read, chat:write, im:history, im:write.The GitHub token is a fine-grained personal access token from github.com/settings/tokens. Select your target repository and grant read access to Issues, Pull Requests, Contents, and Metadata. No write access is needed.The Bedrock API key comes from the AWS console under Amazon Bedrock → API keys. Enable the Claude Sonnet 4.6 model under Model access first. One detail that catches everyone: Claude 4.x models require a cross-region inference profile prefix. Use us.anthropic.claude-sonnet-4-6, not anthropic.claude-sonnet-4-6. The bare ID returns "on-demand throughput isn't supported". With credentials in .env, npm install and node index.js is all it takes. The bot logs the number of GitHub tools loaded and is ready to receive mentions. Extending to Other MCP Servers loadServer() is the entire extension mechanism. Call it with any MCP-compatible service. The registry grows, Claude discovers the new tools, and you add one line to the system prompt describing when to use them. MCP Server URL What it adds Linear mcp.linear.app/mcp Issues, projects, cycles, roadmaps Cloudflare mcp.cloudflare.com/mcp Workers, analytics, DNS, R2 storage Stripe mcp.stripe.com/mcp Payments, customers, subscriptions Custom @modelcontextprotocol/sdk Any internal REST API as MCP tools What We Ran Into in Production We have been running this bot on a busy engineering repository for several months. Before sharing the limitations we documented, it is worth saying that none of them were showstoppers — but they were real, and ignoring them would leave you unprepared. The biggest adjustment was latency. Complex queries that trigger 6 to 8 tool calls take 15 to 30 seconds. We handled this with the thinking-indicator pattern — post a placeholder message immediately, then update it when the answer is ready — which kept the experience feeling responsive even when the underlying calls were slow. Debugging took more work than expected. When a traditional API client gives a wrong answer, the fix is obvious. When an agentic loop gives a wrong answer, you need to know which tools Claude chose, what they returned, and how Claude reasoned over the results. We solved this by logging every tool call — name, input, result, timestamp — and shipping those logs to our observability platform. That log became our primary debugging tool for agent behavior. Prompt quality turned out to be load-bearing in a way we did not fully anticipate. Early versions of the bot would return raw asterisks in Slack, generate unusable walls of text for large result sets, and occasionally pick the wrong tool. Each of these was a prompt fix, not a code fix. Investing time in the system prompt before going live would have saved us several rounds of iteration. Cost is worth monitoring at scale. A query triggering 8 Bedrock calls costs meaningfully more than a single response. For an internal team tool used dozens of times a day, the cost is negligible. At higher volume, it warrants attention. The productivity gain from not maintaining API clients has outweighed all of these constraints at the scale we operate. The right framing is not "is MCP perfect?" but "is MCP better than the alternative?" For our team, the answer has consistently been yes. Conclusion The architecture across these two articles is intentionally small. A tool registry, a generic agent loop, and a system prompt that configures behavior — that is all of it. The 150 lines in the repository are a starting point your team can run today and grow as your toolchain does. Start with GitHub MCP. Get it answering questions in Slack. Test it with your team. Then look at what they ask most often and which data sources those questions touch. The next MCP server to register will be obvious. The code to add it is four lines. If you landed on Part 2 first and want to understand the architectural reasoning — why MCP is a fundamentally better pattern than the conventional REST API wrapper approach, and why it matters especially for SRE and platform teams — Part 1 covers all of that and is the recommended starting point. Part 1: Why MCP Changes Everything About AI Tool Integration. Full project code: https://github.com/sangharshcs/slack-github-mcp-bot. More
Compliance Automated Standard Solution (COMPASS), Part 11: Compliance as Code, the OSCAL MCP Server Way
Compliance Automated Standard Solution (COMPASS), Part 11: Compliance as Code, the OSCAL MCP Server Way
By Yuji Watanabe
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
By Sangharsh Agarwal
Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
By Vivek Venkatesan
Implementing Observability in Distributed Systems Using OpenTelemetry
Implementing Observability in Distributed Systems Using OpenTelemetry

Modern distributed systems demand observability, the ability to understand internal states from external outputs. Observability is achieved by collecting traces, logs, and metrics to improve performance, reliability, and availability. No single signal is sufficient; it's the combination and correlation of these data that form a narrative for root cause analysis. In monolithic applications, debugging was easier since one service handled a request. In contrast, microservices distribute a request across many services, making it hard to follow a transaction’s path. OTel’s distributed tracing shines here; it propagates context with each request, so you can trace a transaction across service boundaries. This means when Service A calls Service B, they share a common trace ID, allowing you to view a single trace spanning multiple services. Similarly, OpenTelemetry can attach unique identifiers to logs, making it easier to correlate log events across services. Overall, OTel provides a unified API for instrumenting code and an ecosystem of instrumentation libraries for frameworks that can automatically capture common operations. It focuses on data generation and collection, while the actual storage and querying of telemetry is handled by backend tools. Setting Up OpenTelemetry in a Python Microservice Installation To get started, install the OpenTelemetry libraries for Python. At minimum, you'll need the API and SDK, plus exporters/instrumentation for your use case. For example: PowerShell pip install opentelemetry-api opentelemetry-sdk \ opentelemetry-exporter-prometheus \ opentelemetry-instrumentation-flask This installs the core OTel API/SDK and the Prometheus metrics exporter and Flask instrumentation. You might also install the OTel OTLP exporter, which is a generic exporter that can send data to an OpenTelemetry Collector or other backend via the OTLP protocol. Additionally, it's recommended to set a service name for your application so that telemetry from this service is identifiable. This can be done via code or an environment variable. In code, you'll see below how we attach a service name as a resource attribute so that traces and metrics are tagged with service.name. Distributed Tracing With OpenTelemetry Tracing involves capturing spans that represent units of work in the system. In a microservice, a span could represent an incoming HTTP request, a database query, or an external API call. Spans form a trace when linked together via context propagation. Using OpenTelemetry, we can instrument our Python service to create spans for critical operations and automatically propagate the trace context to downstream services. First, let's initialize OpenTelemetry tracing in our Python microservice. We create a tracer provider, configure an exporter, and obtain a tracer instance: Python import time from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import ConsoleSpanExporter, BatchSpanProcessor from opentelemetry.sdk.resources import Resource, SERVICE_NAME # Set up tracer provider with service name for identification trace.set_tracer_provider(TracerProvider(resource=Resource.create({SERVICE_NAME: "order-service"}))) tracer = trace.get_tracer(__name__) # Configure a span processor with a Console exporter (prints trace data to stdout) trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(ConsoleSpanExporter())) # Example: instrument a code block with a span with tracer.start_as_current_span("process_order"): # Simulate processing (e.g., calling another service or performing work) time.sleep(0.1) # If this code calls another service, OpenTelemetry context propagates via HTTP headers automatically In this snippet, we configured a TracerProvider with a resource attribute service.name="order-service" so that all spans from this service are labeled. We added a BatchSpanProcessor with a ConsoleSpanExporter this will batch and print our spans to the console in JSON for demonstration. In a real system, you might use a Jaeger exporter here to send spans to a Jaeger agent. The tracer = trace.get_tracer(__name__) gives us a tracer we can use to start spans. We then start a span named "process_order" using a context manager (start_as_current_span), which automatically ends the span when the block exits. Inside that span, you would put the operation you want to measure. Metrics Collection and Export (Prometheus Integration) While tracing shows the path of individual requests, metrics provide aggregated insights into system behavior. OpenTelemetry’s metrics API allows you to define instruments like counters and histograms to record these values. First, ensure the Prometheus client/exporter is set up. We’ll use OTel’s Prometheus exporter, which works by exposing a /metrics HTTP endpoint that Prometheus will scrape. In code, this is done by creating a PrometheusMetricReader and starting an HTTP server for metrics. Here’s how you can integrate metrics in a Flask microservice: Python from flask import Flask, request, g from prometheus_client import start_http_server from opentelemetry import metrics from opentelemetry.exporter.prometheus import PrometheusMetricReader from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.sdk.resources import Resource, SERVICE_NAME from opentelemetry.instrumentation.flask import FlaskInstrumentor import time # Initialize metrics provider with Prometheus exporter (reader) resource = Resource(attributes={SERVICE_NAME: "order-service"}) reader = PrometheusMetricReader() # exposes metrics in Prometheus format provider = MeterProvider(resource=resource, metric_readers=[reader]) metrics.set_meter_provider(provider) meter = metrics.get_meter(__name__) # Define metric instruments request_counter = meter.create_counter( name="app_requests_total", description="Total number of requests processed", unit="1" ) request_latency = meter.create_histogram( name="app_request_latency_ms", description="Request latency in milliseconds", unit="ms" ) # Start Prometheus client on an endpoint (e.g., port 8000) for scraping start_http_server(port=8000, addr="0.0.0.0") # Flask app and instrumentation app = Flask(__name__) FlaskInstrumentor().instrument_app(app) # auto-instrument Flask for tracing @app.before_request def before_request(): g.start_time = time.time() # Increment counter for each incoming request, with the route path as a label request_counter.add(1, {"endpoint": request.path}) @app.after_request def after_request(response): # Record the request duration in milliseconds duration_ms = (time.time() - g.start_time) * 1000 request_latency.record(duration_ms, {"endpoint": request.path}) return response # Example route @app.route("/hello") def hello(): return "Hello, World!", 200 In the setup above, we configured a MeterProvider with a PrometheusMetricReader. This essentially registers an HTTP endpoint that exposes our metrics in Prometheus format. We explicitly call start_http_server(port=8000) to start the metrics server on port 8000, Prometheus will scrape this. We created two metric instruments: a counter to count the number of requests, and a histogram to track the distribution of request durations. In the Flask hooks, we use these instruments: at the beginning of each request, we note the start time and increment the counter. After the request is handled, we compute the elapsed time and record it in the histogram again, labeled by the endpoint path. These labels let us break down metrics per route. Log Correlation With OpenTelemetry Logs are the third pillar of observability. They provide detailed event information and error messages. OpenTelemetry can augment logging by injecting trace context into logs, so that you know which trace/span a log entry is associated with. In Python, the package opentelemetry-instrumentation-logging can automatically enrich Python logging records with trace context. After installing it, you can enable it with: Python from opentelemetry.instrumentation.logging import LoggingInstrumentor LoggingInstrumentor().instrument(set_logging_format=True) This will ensure that whenever you call the standard logging functions, if a trace is currently active, the log record will contain the trace and span IDs. For instance, you might see logs like: Plain Text INFO [trace_id=0xf4a3b...] Order 123 processed successfully indicating that the log was emitted during a specific trace. To fully centralize logs, you would forward them to a log backend. One approach is using the OpenTelemetry Collector to collect and export logs. Conclusion Implementing observability in a microservice architecture is no small feat, but OpenTelemetry greatly simplifies the process by providing a one-stop solution for instrumentation. We have shown how to set up distributed tracing to follow requests across services, how to collect metrics and export them to Prometheus for monitoring, and how to correlate logs with trace context. With these in place, you gain deep visibility into your system. You can monitor performance and identify latency bottlenecks, get alerted on anomalies via metrics, trace requests end-to-end to see where failures occur, and dive into logs for detailed errors. This comprehensive observability is crucial for engineers to effectively maintain and optimize distributed systems. In summary, OpenTelemetry enables a consistent, portable way to implement observability across distributed systems. Embracing it in your microservices will lead to faster debugging, better performance insights, and more resilient applications. With traces, metrics, and logs at your fingertips, you are no longer flying blind in your distributed architecture; instead, you have the data to understand and improve your system continually.

By Mugunth Chandran
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions

When AWS announced Lambda Durable Functions at re: Invent 2025, my first reaction was, "Okay, but how is this different from Step Functions?" I have been building serverless workflows on AWS for a while now, and Step Functions has always been my go-to service for orchestrating multi-step pipelines. So naturally, I wanted to put this new capability to the test. I decided to build a simple document processing workflow, an ETL pipeline with human-in-the-loop approval using both Durable Functions and Step Functions, then run 1,000 actual document processing workflows through each system. What I found surprised me. Not just the cost difference (79% cheaper with Durable Functions), but the trade-offs that nobody is really talking about yet. In this tutorial, I will walk you through building a zero-cost approval workflow using Lambda Durable Functions with Python. Along the way, I will share the actual cost numbers and the lessons that would have saved me a few hours of debugging. The Problem: Approval Workflows Are Expensive If you have ever built a document processing system that requires human approval, you know the pain. Someone uploads a file, your system processes it, and then... it sits there. Waiting for a human to review and approve it. That wait can be 5 minutes, 20 minutes, or even hours. Traditional approaches to handling this waiting are: Polling: Your code keeps checking every 30 seconds — "Is it approved yet? How about now?" making those calls the entire time.Always-on server: An EC2 instance or ECS container sits idle, costing you money 24/7, just to catch that one approval event.External state management: You build a custom solution with DynamoDB, SQS, and Lambda triggers — works fine, but it requires you to maintain a state machine you built yourself. What if your workflow could just... pause? No compute charges. No polling. Just pause, wait for the human to do their thing, and resume exactly where it left off. That is exactly what Lambda Durable Functions enables with the wait_for_callback pattern. What We Are Building Here is the workflow we will implement: Extract data → Transform data → Load data → Wait for approval (≈20 min) → Finalize & archive A CSV file gets uploaded to an S3 bucket under the uploads/ prefix. Our durable function picks it up, runs it through three ETL steps (extract, transform, load), then pauses execution and waits for a human to approve the processed data through a shared approval API. Once approved, the function resumes, finalizes the job, and archives the file. The key part? During that 20-minute (or 2-hour, or 2-day) approval wait, you pay absolutely nothing for compute. Architecture Overview The project uses three separate SAM stacks: Markdown shared-resources/ # Approval API, DynamoDB, SNS (shared by both systems) durable-functions/ # Lambda Durable Functions ETL pipeline step-functions/ # Step Functions ETL pipeline (for comparison) The shared approval handler serves for both workflow types using a single API. When a job comes in for approval, it checks the workflowType field, and if it is durable-functions, it calls send_durable_execution_callback_success.If step-functions, it calls send_task_success. Same API endpoint, different callback mechanisms under the hood. Prerequisites Before we begin, make sure you have the following: AWS SAM CLI (latest version recommended)Python 3.14 runtime AWS account with Lambda, DynamoDB, S3, SNS, and API Gateway accessDocker for local Lambda testing Check your SAM CLI version: Markdown sam --version Step 1: Deploy Shared Resources First Before the ETL pipeline, we need the shared infrastructure — the approval API, DynamoDB table for pending approvals, and SNS topic for notifications. Here is the shared-resources/ SAM template: YAML # shared-resources/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Shared resources for ETL approval workflow Parameters: ApproverEmail: Type: String Description: Email address to receive approval notifications Default: [email protected] Resources: PendingApprovalsTable: Type: AWS::DynamoDB::Table Properties: TableName: etl-pending-approvals BillingMode: PAY_PER_REQUEST AttributeDefinitions: - AttributeName: jobId AttributeType: S KeySchema: - AttributeName: jobId KeyType: HASH TimeToLiveSpecification: AttributeName: ttl Enabled: true ApprovalNotificationTopic: Type: AWS::SNS::Topic Properties: TopicName: etl-approval-notifications Subscription: - Endpoint: !Ref ApproverEmail Protocol: email ApprovalApi: Type: AWS::Serverless::Api Properties: Name: ETL-Approval-API StageName: prod ApprovalHandlerFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETL-Approval-Handler CodeUri: ./src Handler: approval_handler.handler Runtime: python3.14 MemorySize: 256 Timeout: 30 Environment: Variables: APPROVALS_TABLE: !Ref PendingApprovalsTable Policies: - DynamoDBCrudPolicy: TableName: !Ref PendingApprovalsTable - Version: '2012-10-17' Statement: - Effect: Allow Action: - states:SendTaskSuccess - states:SendTaskFailure - lambda:SendDurableExecutionCallbackSuccess - lambda:SendDurableExecutionCallbackFailure Resource: '*' Events: ApproveJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /approve/{jobId} Method: POST RejectJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /reject/{jobId} Method: POST GetJobStatus: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /status/{jobId} Method: GET Notice the approval handler has permissions for both states:SendTaskSuccess (Step Functions) and lambda:SendDurableExecutionCallbackSuccess (Durable Functions). This is the shared handler approach, one API that works with both workflow types. Deploy it: Markdown cd shared-resources sam build sam deploy --guided Step 2: The Durable Functions SAM Template Now the ETL pipeline itself for the Duration Functions. The key addition is the DurableConfig property. The DurableConfig property tells Lambda to enable durable execution for your function. YAML # durable-functions/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Lambda Durable Functions ETL Pipeline Globals: Function: Runtime: python3.14 Architectures: - arm64 MemorySize: 512 Timeout: 900 Resources: ETLOrchestratorFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETLDurableOrchestrator CodeUri: ./src Handler: handlers/etl_handler.lambda_handler MemorySize: 1024 Timeout: 900 DurableConfig: ExecutionTimeout: 86400 # 24 hours for human approval RetentionPeriodInDays: 14 # Keep execution history for debugging AutoPublishAlias: live Policies: - AWSLambdaBasicExecutionRole - S3CrudPolicy: BucketName: !Sub "${RawBucketName}-${AWS::AccountId}" - S3CrudPolicy: BucketName: !Sub "${ProcessedBucketName}-${AWS::AccountId}" - DynamoDBCrudPolicy: TableName: !Ref ETLMetadataTable - DynamoDBCrudPolicy: TableName: etl-pending-approvals - SNSPublishMessagePolicy: TopicName: etl-approval-notifications Events: S3Upload: Type: S3 Properties: Bucket: !Ref RawDataBucket Events: s3:ObjectCreated:* Filter: S3Key: Rules: - Name: prefix Value: uploads/ - Name: suffix Value: .csv Environment: Variables: PROCESSED_BUCKET: !Sub "${ProcessedBucketName}-${AWS::AccountId}" METADATA_TABLE: !Ref ETLMetadataTable APPROVALS_TABLE: etl-pending-approvals APPROVAL_TOPIC_ARN: !ImportValue ETL-ApprovalTopicArn APPROVAL_API_URL: !ImportValue ETL-ApprovalApiUrl A few things to notice here: MemorySize: 1024 on the orchestrator (overrides the 512 MB global default). Since this single function does all the work, it needs more memory.ExecutionTimeout: 86400 – This is the total workflow duration across all invocations (24 hours). The standard Timeout: 900 is the per-invocation limit (15 minutes). Each checkpoint/resume is a fresh invocation.AutoPublishAlias: live – AWS recommends using Lambda versions with durable functions. If you update code while an execution is suspended, replay will use the version that started the execution.S3 filter with prefix: uploads/ and suffix: .csv – Only CSV files under the uploads/ directory trigger the workflow.The stack imports shared resources via !ImportValue the approval table, SNS topic, and API URL from the shared stack. Step 3: Writing the Durable Function This is where it gets interesting. The entire ETL pipeline, including the approval wait, lives in a single Lambda function. No state machine definition. No JSON DSL. Just Python code. First, the individual ETL steps. Each one is a regular Python function in a separate file: Extract Python import csv import io import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def extract_data(source_bucket, source_key, step_context=None): logger.info(f"Extracting from s3://{source_bucket}/{source_key}") response = s3_client.get_object(Bucket=source_bucket, Key=source_key) content = response["Body"].read().decode("utf-8") reader = csv.DictReader(io.StringIO(content)) records = list(reader) schema = { "columns": reader.fieldnames, "source_file": source_key, "file_size_bytes": response["ContentLength"] } logger.info(f"Extracted {len(records)} records with {len(schema['columns'])} columns") return {"data": records, "record_count": len(records), "schema": schema} Transform Python import logging from datetime import datetime logger = logging.getLogger() def transform_data(raw_data, schema_config, step_context=None): logger.info(f"Transforming {len(raw_data)} records") valid_records, rejected_records = [], [] for i, record in enumerate(raw_data): try: cleaned = {k: v.strip() if isinstance(v, str) else v for k, v in record.items()} if not cleaned.get("id") or not cleaned.get("name"): rejected_records.append({"index": i, "reason": "Missing required field"}) continue if "date" in cleaned: cleaned["date"] = normalize_date(cleaned["date"]) cleaned["_processed_at"] = datetime.utcnow().isoformat() for key in ["amount", "quantity", "price"]: if key in cleaned and cleaned[key]: try: cleaned[key] = float(cleaned[key]) except ValueError: cleaned[key] = None valid_records.append(cleaned) except Exception as e: rejected_records.append({"index": i, "reason": str(e)}) return { "data": valid_records, "valid_records": len(valid_records), "rejected_records": len(rejected_records), "rejection_details": rejected_records[:100] } def normalize_date(date_str): for fmt in ["%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%Y/%m/%d"]: try: return datetime.strptime(date_str, fmt).strftime("%Y-%m-%d") except ValueError: continue return date_str Load Python import json import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def load_data(transformed_data, target_bucket, target_key, step_context=None): logger.info(f"Loading {len(transformed_data)} records to s3://{target_bucket}/{target_key}") output_lines = "\n".join(json.dumps(r) for r in transformed_data) s3_client.put_object( Bucket=target_bucket, Key=target_key, Body=output_lines.encode("utf-8"), ContentType="application/jsonl", Metadata={"record_count": str(len(transformed_data))} ) summary = { "record_count": len(transformed_data), "columns": list(transformed_data[0].keys()) if transformed_data else [], "sample_records": transformed_data[:3] } return {"target_path": f"s3://{target_bucket}/{target_key}", "record_count": len(transformed_data), "summary": summary} Notice the steps are plain Python functions — no special decorator, no SDK import. They take step_context=None as an optional last parameter, which keeps them testable outside the durable execution context. Now the main ETL orchestrator that ties it all together: Python import json import os import logging from datetime import datetime from aws_durable_execution_sdk_python import durable_execution, DurableContext from steps.extract import extract_data from steps.transform import transform_data from steps.load import load_data from steps.finalize import finalize_job logger = logging.getLogger() logger.setLevel(logging.INFO) PROCESSED_BUCKET = os.environ.get("PROCESSED_BUCKET") METADATA_TABLE = os.environ.get("METADATA_TABLE") @durable_execution def lambda_handler(event, context: DurableContext): # Handle both S3 event format and direct invocation if "Records" in event: s3_event = event["Records"][0]["s3"] source_bucket = s3_event["bucket"]["name"] source_key = s3_event["object"]["key"] else: source_bucket = event.get("bucket") source_key = event.get("key") # Generate job_id deterministically using context.step() job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) context.logger.info(f"Starting ETL job: {job_id}") # Step 1: Extract extracted = context.step( lambda _: extract_data(source_bucket, source_key, None), name="extract-data" ) context.logger.info(f"Extracted {extracted['record_count']} records") # Step 2: Transform transformed = context.step( lambda _: transform_data(extracted["data"], extracted.get("schema", {}), None), name="transform-data" ) # Step 3: Load loaded = context.step( lambda _: load_data(transformed["data"], PROCESSED_BUCKET, f"processed/{job_id}/output.jsonl", None), name="load-data" ) # --- EXECUTION PAUSES HERE --- # The submitter function stores the callback_id in DynamoDB # and sends an SNS notification to the reviewer. # No compute charges while waiting for approval. def submit_for_approval(callback_id: str, ctx): return notify_reviewer(job_id, callback_id, loaded["summary"]) approval = context.wait_for_callback( submitter=submit_for_approval, name="quality-check-approval" ) # Parse approval result if isinstance(approval, str): approval = json.loads(approval) if not approval or not approval.get("approved"): return {"status": "REJECTED", "job_id": job_id, "reason": approval.get("reason", "No reason")} # Step 4: Finalize (only runs after approval) final = context.step( lambda _: finalize_job(job_id, source_bucket, source_key, loaded, approval, METADATA_TABLE, None), name="finalize-job" ) return { "status": "COMPLETED", "job_id": job_id, "records_processed": transformed["valid_records"], "output_path": loaded["target_path"], "approved_by": approval.get("reviewer"), "completed_at": final["completed_at"] } Let me break down the important parts: @durable_execution – This decorator (imported from aws_durable_execution_sdk_python) enables the checkpoint/replay mechanism on the handler.context.step(lambda _: ..., name="...") – Each step call creates a checkpoint. On replay, completed steps return their cached results instantly instead of re-executing.context.wait_for_callback(submitter=..., name="...") – This is the zero-cost waiting magic. The submitter function receives a callback_id which gets stored in DynamoDB. Execution then pauses completely — Lambda saves the state, shuts down, and you stop paying.Determinism matters – Notice job_id is generated inside a context.step(). That is intentional. Since Lambda replays your function from the beginning on resume, datetime.utcnow() would produce a different value on each replay. Wrapping it in a step ensures the timestamp gets checkpointed and replayed consistently. The notify_reviewer function (in the same file) stores the callback details in DynamoDB and sends an SNS notification: Python def notify_reviewer(job_id, callback_id, summary): import boto3 from datetime import timedelta dynamodb = boto3.resource('dynamodb') sns_client = boto3.client('sns') approvals_table = os.environ.get('APPROVALS_TABLE', 'etl-pending-approvals') approval_topic_arn = os.environ.get('APPROVAL_TOPIC_ARN') approval_api_url = os.environ.get('APPROVAL_API_URL') table = dynamodb.Table(approvals_table) ttl = int((datetime.utcnow() + timedelta(hours=24)).timestamp()) table.put_item(Item={ 'jobId': job_id, 'callbackId': callback_id, 'functionArn': os.environ.get('AWS_LAMBDA_FUNCTION_NAME'), 'workflowType': 'durable-functions', 'summary': json.dumps(summary), 'status': 'pending', 'requestedAt': datetime.utcnow().isoformat(), 'ttl': ttl }) if approval_topic_arn: sns_client.publish( TopicArn=approval_topic_arn, Subject=f'ETL Job Approval Required: {job_id}', Message=f"Job ID: {job_id}\n" f"Approve: POST {approval_api_url}/approve/{job_id}\n" f"Reject: POST {approval_api_url}/reject/{job_id}" ) return {"job_id": job_id, "callback_id": callback_id, "status": "pending"} The workflowType: 'durable-functions' field is important — it tells the shared approval handler which callback mechanism to use when the reviewer responds. Step 4: The Shared Approval Handler When the reviewer clicks approve, the shared handler looks up the callbackId from DynamoDB and sends the callback to the paused durable execution: Python # shared-resources/src/approval_handler.py (key excerpt) if workflow_type == 'durable-functions': callback_id = approval_record.get('callbackId') if approved: lambda_client.send_durable_execution_callback_success( CallbackId=callback_id, Result=json.dumps(approval_response) ) else: lambda_client.send_durable_execution_callback_failure( CallbackId=callback_id, Error='JobRejected', Cause=reason or 'Job rejected by reviewer' ) elif workflow_type == 'step-functions': task_token = approval_record.get('taskToken') if approved: stepfunctions.send_task_success( taskToken=task_token, output=json.dumps(approval_response) ) Same API, same reviewer experience — the underlying callback mechanism is the only thing that differs. Step 5: Deploy and Test Deploy in order (shared resources first, since the other stacks import from it): Markdown # 1. Deploy shared resources cd shared-resources sam build && sam deploy --guided # 2. Deploy Durable Functions cd ../durable-functions sam build && sam deploy --guided Generate test data: Markdown python scripts/generate_test_data.py --count 10 --output test-data/ Upload files to trigger the workflow (note the uploads/ prefix — the S3 filter requires it): Markdown aws s3 cp test-data/ s3://etl-raw-data-bucket-YOUR_ACCOUNT_ID/uploads/ --recursive Check approval status and approve: Markdown # Check status curl https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/status/<job-id> # Approve curl -X POST https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/approve/<job-id> \ -H "Content-Type: application/json" \ -d '{"reviewer": "harpreet", "reason": "Data looks good"}' For bulk approvals during testing, the repo includes a handy script: Markdown ./scripts/approve_all_jobs.sh For local testing, the testing SDK supports pytest: Markdown pip install aws-lambda-durable-execution-sdk-testing pytest durable-functions/tests/ Step 6 (Optional): Deploy Step Functions for Comparison If you want to reproduce my full comparison, deploy the Step Functions stack too: Markdown cd step-functions sam build && sam deploy --guided Here is what the same workflow looks like in Amazon States Language: JSON { "StartAt": "ExtractData", "States": { "ExtractData": { "Type": "Task", "Resource": "${ExtractFunctionArn}", "ResultPath": "$.extractResult", "Next": "TransformData" }, "TransformData": { "Type": "Task", "Resource": "${TransformFunctionArn}", "ResultPath": "$.transformResult", "Next": "LoadData" }, "LoadData": { "Type": "Task", "Resource": "${LoadFunctionArn}", "ResultPath": "$.loadResult", "Next": "WaitForApproval" }, "WaitForApproval": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken", "Parameters": { "FunctionName": "${ApprovalFunctionArn}", "Payload": { "taskToken.$": "$$.Task.Token", "jobId.$": "$.loadResult.job_id", "summary.$": "$.loadResult.summary" } }, "TimeoutSeconds": 86400, "ResultPath": "$.approvalResult", "Next": "CheckApproval" }, "CheckApproval": { "Type": "Choice", "Choices": [{ "Variable": "$.approvalResult.approved", "BooleanEquals": true, "Next": "FinalizeJob" }], "Default": "JobRejected" }, "JobRejected": { "Type": "Pass", "Result": { "status": "REJECTED" }, "End": true }, "FinalizeJob": { "Type": "Task", "Resource": "${FinalizeFunctionArn}", "End": true } } } Compare the two approaches. Durable Functions: one Python file, one Lambda, familiar programming constructs. Step Functions: a JSON state machine definition, five separate Lambda functions, plus the ASL learning curve. Both do the same thing. The Real Cost Numbers Now, here is the part that made me rebuild a mental model I had about serverless orchestration costs. I ran 1,000 CSV files through this exact workflow — both with Durable Functions and with the Step Functions implementation. The approval wait averaged about 20 minutes per document. Cost ComponentDurable FunctionsStep FunctionsDifferenceLambda invocations$0.000358$0.001-64%Lambda duration$0.0308$0.0179+72%State transitions$0.000$0.175-100%DynamoDB$0.003$0.0030%S3 operations$0.010$0.0100%TOTAL$0.044$0.207-79% Source: AWS CloudWatch Metrics The total cost, which is 79% cheaper, is mainly driven almost entirely by one thing: state transitions. Step Functions charges $0.025 per 1,000 state transitions. ASL workflow has 7 states (ExtractData, TransformData, LoadData, WaitForApproval, CheckApproval, JobRejected/FinalizeJob). For 1,000 workflows, that is 7,000 transitions, which costs $0.175. That single line (state transition) item is 84% of the total Step Functions cost. Durable Functions eliminates state transition costs. The trade-off? Higher Lambda duration costs ($0.031 vs. $0.018) because the durable function runs with 1,024 MB memory (single function handling all work) compared to Step Functions using 512 MB per function across five smaller functions. At scale, the difference adds up quickly: Daily VolumeDurable Functions/yearStep Functions/yearAnnual Savings1,000/day$16.06$75.56$59.5010,000/day$160.60$755.60$595100,000/day$1,606$7,556$5,950 And the most important validation: both systems achieved $0 compute cost during the 20-minute approval wait. That is the real game-changer compared to polling or always-on servers. Understanding the Replay Model One thing that confused me initially was the invocation count. I expected 1,000 invocations for 1,000 workflows. Instead, I got 1,788. Here is why. The checkpoint/replay model means each workflow requires a minimum of 2 invocations: Initial invocation — S3 trigger fires, function runs generate-job-id → extract → transform → load → submit-for-approval → pauseResume invocation — Callback received, function replays from the beginning (all completed steps return cached results instantly), then executes the finalize step So the theoretical minimum is 2,000 invocations for 1,000 workflows. The actual number was 1,788 because some workflows were still pending approval when I collected the metrics over the 24-hour measurement window. The important thing to remember: your code must be deterministic. Since Lambda replays your function from the beginning on resume, any non-deterministic operations (random numbers, timestamps, external API calls) must happen inside context.step() blocks so their results get checkpointed. Python job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) That is exactly why the job_id generation in our code uses context.step().Without it, the timestamp would change on every replay. Here are some other examples where your code must be deterministic and how to avoid that: Deterministic IsssuesWhy It BreaksSolutionMath.random()Different value on every replayWrap in context.step()Date.now()Time keeps moving forwardUse context.timestamp or wrap in a stepGlobal variablesMight change between replaysPass state through function argumentsExternal API callsNetwork is a lieAlways wrap in context.step()Iterating over Map or SetIteration order can vary by runtimeUse arrays or ensure stable ordering When Not to Use Durable Functions I want to be honest about the trade-offs, because this is not a "Durable Functions is better than everything" story. Choose Step Functions when: Visual debugging matters. The step function state machine execution graph is genuinely superior. You can see exactly which step failed, inspect the input/output of each state, and non-technical stakeholders can actually understand what the workflow is doing. With Durable functions, AWS did provide visual analysis, monitoring, and debugging as well, but its little more developer-friendly. Multi-service orchestration. Step Functions has 220+ native AWS service integrations. DynamoDB, SQS, SNS, ECS, and Glue without writing Lambda glue code. In our Step Functions implementation, the ASL connects directly to Lambda function ARNs with built-in retry policies. With Durable Functions, all integrations go through your Lambda code.Express Workflows apply. For short-duration (under 5 minutes), high-throughput workflows, Step Functions Express Workflows use a different pricing model that can be very competitive. Choose Durable Functions when: Cost optimization is the priority (79% savings at scale)Workflows are Lambda-centric (your logic lives in Lambda code anyway)You prefer writing orchestration in Python/TypeScript/over Amazon States Language. AWS just now released Lambda Duration functions with Java in developer preview.Your logic is complex, and dynamic programming language is preferred by developers over the declarative ASL. AWS recommends a hybrid approach: use Durable Functions for application-level logic within Lambda, and Step Functions for high-level orchestration across multiple AWS services. Concurrency Planning — A Quick Note One thing worth mentioning: Durable Functions consolidates your entire workflow into a single Lambda function (ETLDurableOrchestrator in our case). This means your Lambda concurrency quota directly limits how many workflows can run simultaneously. Step Functions distributes execution across five separate Lambda functions (Extract, Transform, Load, Approval, Finalize), spreading the concurrency demand. In practice, this means you should plan your Lambda concurrency quotas carefully when using Durable Functions. If you expect burst uploads of hundreds or thousands of files at once, set reserved concurrency appropriately for your workload. This applies to both services — the difference is just where the concurrency demand concentrates. Wrapping Up Lambda Durable Functions is a genuinely useful addition to the serverless toolkit. For a simple ETL pipeline with human-in-the-loop approval, it delivered 79% cost savings over Step Functions while achieving the same 100% success rate and zero-cost waiting. The code-first approach feels natural if you are already comfortable writing Lambda functions in Python, TypeScript, or Java. The wait_for_callback pattern for human approvals is clean and straightforward. And the cost savings are real, which is driven entirely by the elimination of state transition charges. That said, Step Functions remains the better choice when visual workflow representation, multi-service orchestration, or operational simplicity are your priorities. There is no universal winner here, and it depends on what your team values more. The complete implementation — both SAM stacks, shared approval infrastructure, test data generation scripts, bulk approval scripts, and detailed cost analysis — is available here: github.com/hsiddhu2/aws-lambda-durable-vs-stepfunctions. Clone it, deploy both implementations, run your own 1,000-file comparison, and see the numbers for yourself. The ~79% cost advantage held consistent for this workflow, but your number will vary based on workflow complexity and state count.

By Harpreet Siddhu
Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me
Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me

My data catalog project was the third time in my career that I had led a catalog implementation. My first was a custom-built solution in 2015 that worked but required three engineers to maintain. Number two was an off-the-shelf tool that nobody used because it was too cumbersome to keep current. For this third attempt, I wanted to get it right. We implemented Azure Purview for automated discovery and technical metadata, and Collibra for business glossary, data ownership, and governance workflows. They serve different functions and are connected through a custom integration. Here is how we set it up and what surprised us. Why Two Tools? Azure Purview is excellent at automated technical metadata collection. Purview scans your data sources on a schedule, discovers tables and columns, infers data types, and builds an automatically-maintained lineage graph. Automated discovery is its primary value. Doing this manually doesn't scale, and any manually-maintained catalog falls behind the actual state of the data within months. Purview isn't good at business governance workflows: data stewardship, business term assignment, data quality certification, access request approvals. These require human processes with approvals and audit trails that Purview's workflow capabilities do not cover adequately. Collibra handles the governance workflow side. Business data stewards maintain the business glossary in Collibra. Ownership assignments and data quality certifications go through Collibra's workflow engine. When a data consumer wants to know what a dataset means in business terms, they look in Collibra. When they want to know where the data physically lives and what its schema is, they look in Purview. The Purview Setup Purview scans are configured per data source. We set up scans for our three ADLS Gen2 storage accounts, our Azure SQL databases, our Databricks Unity Catalog, and our Azure Data Factory pipelines. Scans run daily for production data sources and weekly for development. Purview builds a lineage graph from ADF pipelines, which is genuinely useful. We can see, for any given table, which pipelines write to it and which tables it reads from. Lineage tracking has been valuable three times in incident investigations where we needed to understand the upstream sources of a corrupted dataset. Custom classifications are worth the setup time. Purview comes with built-in classifiers for common PII patterns: email addresses, phone numbers, credit card numbers, and national ID formats for several countries. We added custom classifiers for our internal account number formats and insurance policy number patterns. Automated classification isn't perfect, about 85% accurate in our testing, but it surfaces PII-candidate columns that manual review would miss. Python # Purview scan configuration (REST API) import requests def create_purview_scan(account_name, collection, data_source): url = (f"https://{account_name}.purview.azure.com/scan/datasources/" f"{data_source}/scans/daily-production-scan") body = { "kind": "AzureStorageMsi", "properties": { "scanRulesetName": "custom-pii-ruleset", "scanRulesetType": "Custom", "collection": {"referenceName": collection}, "credential": { "referenceName": "managed-identity", "credentialType": "ManagedIdentity" } }, "trigger": { "recurrence": { "frequency": "Day", "interval": 1, "startTime": "2024-01-01T02:00:00Z", "timezone": "UTC" } } } resp = requests.put(url, json=body, headers=get_auth_headers()) return resp.json() # Custom classifier for internal account numbers custom_classifier = { "kind": "Custom", "properties": { "classificationName": "INTERNAL_ACCOUNT_NUMBER", "description": "Internal 12-digit account number format", "classificationRule": { "kind": "Regex", "pattern": "^ACC[0-9]{9}$", "minimumPercentageMatch": 75 } } } The Collibra Integration We built a nightly sync that reads technical metadata from Purview via its REST API and creates or updates corresponding assets in Collibra. Our sync maps Purview datasets to Collibra data assets, adds technical metadata (schema, classification, lineage summary) as attributes on the Collibra asset, and creates a link between the Collibra and Purview assets so users can navigate between the business and technical views. Building this sync took about six weeks of engineering time. It's the part of the implementation I considered most for an off-the-shelf connector, but the available connectors didn't handle our specific Purview classification tagging approach correctly. Our custom sync has been running for 14 months with minimal maintenance. Python # Nightly Purview-to-Collibra metadata sync (Python) import requests from datetime import datetime def sync_purview_to_collibra(purview_client, collibra_client): """Sync technical metadata from Purview to Collibra assets.""" # Fetch all cataloged assets from Purview purview_assets = purview_client.discovery.query( keywords="*", filter={"and": [ {"entityType": "azure_datalake_gen2_path"}, {"classification": ["confidential", "restricted"]} ]}, limit=1000 ) for asset in purview_assets['value']: collibra_asset = collibra_client.find_or_create_asset( name=asset['name'], domain="Data Lake Assets", type="Data Set" ) # Sync technical metadata as attributes collibra_client.update_attributes(collibra_asset['id'], { "Technical Schema": asset.get('schema', ''), "Data Classification": asset.get('classification', []), "Purview Asset Link": asset['id'], "Last Scanned": asset.get('lastScanTime', ''), "Lineage Summary": get_lineage_summary( purview_client, asset['id']), "Sync Timestamp": datetime.utcnow().isoformat() }) return {"synced": len(purview_assets['value']), "timestamp": datetime.utcnow().isoformat()} What Adoption Looked Like Adoption was slow. We launched the catalog with a communication campaign, internal documentation, and a live demo. After three months, we'd had about 30% of our target user base actively using it, primarily data engineers who were looking up lineage information. Analysts and business stakeholders, the people Collibra was primarily designed to support, were largely not using it. Adoption really broke through when we integrated the catalog with our data access request process. Previously, access requests went to a Jira form. We changed the process: to request access to a dataset, you start from the Collibra data asset page. Each access request is contextualized with the asset's classification, ownership, and purpose, which both the requester and the approver see during the approval workflow. Usage of Collibra for data assets grew by 300% in the month after we made this change. Python # Collibra asset mapping schema for access request workflow { "asset_type": "Data Set", "domain": "Data Lake Assets", "attributes": { "Technical Name": {"type": "text", "source": "purview"}, "Business Name": {"type": "text", "source": "steward"}, "Data Classification": { "type": "single_select", "values": ["public", "internal", "confidential", "restricted"], "source": "purview" }, "Owner Team": {"type": "text", "source": "steward"}, "PII Columns": {"type": "multi_select", "source": "purview"}, "Quality Certification": { "type": "single_select", "values": ["certified", "provisional", "uncertified"], "source": "steward" }, "Access Request URL": { "type": "url", "template": "https://collibra.internal/access/{asset_id}" } }, "workflow": { "access_request": { "approvers": ["asset_owner", "data_governance_lead"], "sla_hours": 48, "auto_revoke_days": 365 } } } The Honest Caveat A data catalog requires ongoing investment that is easy to underestimate. Automated parts, Purview's scanning and discovery, take care of themselves. Business governance parts, glossary maintenance, stewardship assignments, and quality certifications require human effort that must be budgeted and owned. Our Collibra business glossary currently covers about 60% of our production datasets. The remaining 40% have technical metadata from Purview but no business context. That 40% is smaller than it was six months ago, which means we are making progress. But it's a real gap that we manage explicitly rather than pretending the catalog is complete.

By Kuladeep Sandra
When Perfect Data Breaks: The Journey from Data Quality to Data Observability
When Perfect Data Breaks: The Journey from Data Quality to Data Observability

The Day Everything Looked Fine — Until It Wasn’t The dashboards were green. Every test passed. And yet, by morning, the company’s revenue had mysteriously dropped by roughly $1 million. The data team huddled together, blinking at their screens. Schema checks? It looked good.Nulls? Checks passed, and everything appeared to be in order.Completeness? It looked good. Nothing looked wrong, except that something was causing the business to bleed. What they didn’t know yet was that an innocent iOS app update had quietly scrambled the order of user events. To the system, customers were suddenly purchasing before browsing. The models didn’t break in code; they broke in meaning. The team discovered a crucial lesson: even flawless data systems can mislead without true observability. Why “Good Data” Isn’t Good Enough Anymore There was a time when data quality was the gold standard and a measure of success. DQ checks meant your dataset is protected. If your dataset were clean, complete, and validated, your insights would be gold. But that was back when pipelines were simple, ETL jobs ran once a night, and life was predictable. Back then, most data was read by people, not systems. Analysts looked at dashboards after the fact, asked questions when numbers felt off, and applied judgment before anyone made a real decision. If a table landed late or a metric looked strange, someone usually noticed; often before it caused real damage. Data quality checks were designed for this world: static, batch-oriented, and tolerant of human interpretation. But as technology changed, so did expectations. Today’s world is different. This shift matters most for data engineers, analytics engineers, and platform teams responsible for the reliability of downstream dashboards, APIs, and machine learning systems. Modern cloud-native companies run thousands of interdependent batch and streaming pipelines, constantly feeding dashboards, APIs, and machine learning systems. A single column rename, a delayed partition, or an unnoticed schema tweak can quietly throw everything off course. Traditional data quality is like checking your car’s oil once a month. Data observability involves installing a dashboard that provides real-time alerts when the engine is overheating. The Shift: From Data Quality to Data Observability Data quality answers the question: “Is this dataset correct right now?” Data observability asks something deeper: “Is my data behaving as it should?” Aspect Data Quality Data Observability Focus Data-at-rest Data-in-motion Checks Accuracy, completeness, validity Freshness, volume, distribution, schema, lineage When Point-in-time Continuous Goal Ensure correctness Ensure reliability View Local End-to-end The Five Pillars of Data Observability Freshness: Is data arriving on time relative to SLAs?Volume: Are record counts within expected ranges?Distribution: Have key statistics (e.g., averages, percentiles) drifted unexpectedly?Schema: Did upstream fields change without notice?Lineage: What depends on what, and who owns it? Together, these pillars act as an early-warning system for your data ecosystem, sensing changes before they cause downstream impact. The Story Behind the $1M Drop Our e-commerce company’s recommendation engine accounted for 40% of revenue. After a routine app update, click-throughs fell by 15%, conversions by 22%, and revenue tumbled. And yet, all quality checks still passed. Check Status Missed Insight Schema ✅ Timestamps changed meaning Nulls ✅ Events arrived out of sequence Ranges ✅ Valid values, wrong order Data quality confirmed the structure. It missed the story. Event order sounds like a minor detail, but for recommendation models, it’s foundational. Browsing before purchasing means something very different than purchasing before browsing. When that sequence flipped, nothing crashed; the model simply learned the wrong story about customers. Since the data remained complete, valid, and schema-compliant, every traditional check passed, even as the model’s understanding of user behavior quietly unraveled. The Hidden Issue The iOS app began batching events. They arrived six hours late and out of order. Before (Healthy) After (Broken) View → Add to Cart → Purchase Purchase → View → Add to Cart The model interpreted chaos as logic, and that’s when recommendations became noise. How Observability Would Have Saved the Day Within two hours, an observability system would have screamed: Freshness Alert: Event lag jumped from 5 mins to 360 minsDistribution Alert: 78% of events out of sequenceLineage Alert: iOS v1.3.0 deployed, impacting 47 tables and degrading 12 ML models Approach Detection Root Cause Resolution Time Data Quality Missed Undetected 3 days Data Observability Caught early iOS v1.3.0 deployment 6 hours Observability didn’t just find the broken data; it connected the dots to the moment things went wrong. The real win wasn’t just catching the issue faster. It was knowing exactly what changed, when it changed, and how far the damage spread. That made it possible to roll back quickly and explain what happened without guesswork. Without observability, teams debate symptoms. With it, they start acting on causes. Building Observability Step by Step So how does a modern data team move from reactive firefighting to proactive confidence? 1. Define Data Contracts Every dataset has a clear, versioned schema (YAML, Avro, Protobuf). Contracts live in code and are automatically validated before pipeline runs and new data is added to the dataset. Data contracts are often the first thing teams skip. They feel slow, bureaucratic, and unnecessary, right up until a breaking change slips through and every downstream table starts lying. 2. Add Freshness & Volume Monitors Track how long data takes to arrive and whether counts fall outside norms. Row updated at timestamp should be within the defined SLO. Define SLOs such as “99% of partitions land within 10 minutes.” Without explicit SLAs, delays are only discovered after dashboards update or don’t. By then, decisions have already been made on stale data. 3. Strengthen Tests Layer dbt checks for `not_null` and `uniqueness` with drift tests — e.g., “average session_length stays within 10% of baseline,” or “count of new orders placed stays within 10% of the baseline.” Basic checks are good at catching broken tables, but they don’t tell you when data starts behaving differently. Drift tests exist for the uncomfortable cases where everything looks valid but isn’t. 4. Emit Lineage Integrate OpenLineage with Airflow or dbt to visualize dependencies and trace impact instantly. Without lineage, every alert triggers a manual investigation. With it, teams can immediately see blast radius and ownership. 5. Centralize Visibility Bring all signals into one pane of glass. When freshness lives in one tool, lineage in another, and alerts in Slack, every incident turns into a scavenger hunt. Pulling those signals together is what turns alerts into answers. Now, when an alert fires, you know what broke, where, and who’s responsible. A Familiar Pattern If this story sounds familiar, it’s because it’s happening everywhere. Teams at Netflix have described recommendation quality degrading after upstream data schemas changed without downstream safeguards.Uber has publicly discussed timezone-related bugs that impacted time-based systems, including pricing and incentives.Airbnb has shared incidents where aggressive deduplication and data-cleaning logic removed valid records.Stripe has written extensively about how tiny currency-rounding errors can quietly compound into material financial discrepancies at scale.Different problems, same root cause: great data quality, no visibility. Let’s Distill the Lesson: Quality Validates. Observability Protects. Data quality ensures your data is correct. Data observability ensures your system stays trustworthy. In today’s interconnected world, where every pipeline is a domino, observability isn’t a luxury; it’s a seatbelt. So the next time your dashboard shows that comforting little green badge labeled “Fresh & Verified,” remember: behind that glow lies a safety net of observability quietly keeping your business upright.

By Divyakumar Savla
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

TL;DR A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine. The Problem We Kept Hitting We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well. But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening. We wanted a cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works? What We Shipped in v0.9.1 Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports. 1. Node Identity Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback: Shell sudo ingero trace --node gpu-node-01 Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) — no extra configuration needed. 2. Fleet Fan-Out Queries Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS — --ca-cert, --client-cert, --client-key — so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then, it’s intended for trusted VPC networks only. The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically: YAML fleet: nodes: - gpu-node-01:8080 - gpu-node-02:8080 - gpu-node-03:8080 - gpu-node-04:8080 A full example config is in configs/ingero.yaml. Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving: Shell $ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \ "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us FROM events GROUP BY node, source" node source cnt avg_us ---------------- ------ ----- ------ gpu-node-01 4 11009 5.2 gpu-node-01 3 847 18400 # ← 9x higher than peers gpu-node-02 4 10892 5.1 gpu-node-02 3 412 2100 gpu-node-03 4 10847 5.3 gpu-node-03 3 398 1900 gpu-node-04 4 10901 5.0 gpu-node-04 3 421 2200 8 rows from 4 node(s) Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains: Shell $ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s) [HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O Root cause: 847 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs [MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O Root cause: 855 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O — checkpoint writes preempting the training process. Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.” 3. Offline Merge and Perfetto Export Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints — there are real reasons the network path isn’t always available. For those cases, ingero merge combines SQLite databases from each node into a single queryable file: Shell # 1. Collect traces from each node scp gpu-node-01:~/.ingero/ingero.db node-01.db scp gpu-node-02:~/.ingero/ingero.db node-02.db # 2. Merge and analyze ingero merge node-01.db node-02.db -o cluster.db ingero explain -d cluster.db Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node. For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline. Why We Built It This Way The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb — the well-trodden path. We deliberately avoided that. No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure — the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet. Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity. Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble — and knowing which nodes failed is diagnostic information in itself. Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query — 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query. Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time. MCP: AI-Driven Fleet Investigation The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster: Python query_fleet(action="chains", since="5m") Fleet Chains: 2 chain(s) [HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O [MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O That’s the complete response — an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected. Where This Stands v0.9.1 is the initial step in cluster-level tracing, not the destination. What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export — these are diagnostic tools for after the fact. We’re actively working on cross-node correlation and straggler detection — more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor. The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments. We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub. The investigations/ directory has ready-to-query databases for trying this without a GPU cluster: sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node clustersample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks) GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design. If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub, and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Related Reading GPU incident response in 60 seconds with eBPF – single-node investigation workflow that the fleet feature extends11-second time to first token on a healthy vLLM server – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this postGPU showing 97% utilization while training runs 3x slower – why nvidia-smi metrics alone miss the real story

By Ingero Team
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch

A DynamoDB throttle alarm fires at 2 am. You confirm the spike in CloudWatch, then check ElastiCache in a second dashboard, then Redshift in a third. Cache hit rate dropped, which hammered DynamoDB, which stalled the zero-ETL export. Three services, three dashboards, one cascade you can only trace by hand. This guide maps the specific metrics, alarm thresholds, and configuration steps for each service, and then addresses the observability delta that CloudWatch leaves unresolved: cross-service correlation, root-cause traceability, and the capacity-planning intelligence that prevents cascades in the first place. What CloudWatch Gives You Across DynamoDB, ElastiCache, and Redshift Prerequisites: The CLI examples and alarm configurations in this guide assume AWS CLI v2, an IAM principal with cloudwatch:GetMetricData, cloudwatch:PutMetricAlarm, and dynamodb:UpdateContributorInsights permissions, and active DynamoDB tables, ElastiCache clusters, or Redshift clusters in your account. CloudWatch publishes metrics for all three services under service-specific namespaces. Per the AWS CloudWatch documentation, metric retention runs in three tiers: 1-minute data points retained for 15 days, 5-minute data points for 63 days, and 1-hour data points for 455 days. NamespaceCategoryKey MetricsAWS/DynamoDBCapacityConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequestsAWS/DynamoDBLatencySuccessfulRequestLatency (p50, p99)AWS/DynamoDBHealthSystemErrorsAWS/ElastiCacheEfficiencyCacheHitRate, EvictionsAWS/ElastiCacheMemoryDatabaseMemoryUsagePercentageAWS/ElastiCacheConnectionsCurrConnections, ReplicationLagAWS/RedshiftPerformanceQueryDuration, QueryQueueTimeAWS/RedshiftWorkloadWLMQueueLength (per queue)AWS/RedshiftResourcesCPUUtilization, ReadIOPS, WriteIOPS For most post-incident investigations, you’ll hit the granularity boundary within two weeks. A throttle spike that lasted 4 minutes on day 17 shows up as a single 5-minute average data point, frequently indistinguishable from normal traffic variation. The per-custom-metric cost also compounds at scale: an account running 40 DynamoDB tables, 6 ElastiCache clusters, and 3 Redshift clusters with per-resource custom alarms can accumulate hundreds of CloudWatch metrics across namespaces, each costing $0.30/month to store and $0.10/alarm/month to evaluate. Each namespace provides enough signal to diagnose its own service, but CloudWatch publishes no native cross-service correlation mechanism. A ThrottledRequests spike in AWS/DynamoDB and a CacheHitRate collapse in AWS/ElastiCache at the same timestamp are both visible, but connecting them as cause and effect requires a human to match timestamps across dashboards. DynamoDB: Throttling Detection, Partition Health, and Capacity Mode Decisions DynamoDB throttling is rarely a single-metric problem. A throttle alarm tells you capacity was exceeded, but not whether the cause is a hot partition, an undersized provisioned table, or a traffic pattern that outgrew your capacity mode. The subsections below work through that diagnostic sequence: the metrics that surface the symptom, the tooling that pinpoints the partition, and the capacity decision that prevents recurrence. Core Metrics and Alarm Thresholds The DynamoDB CloudWatch metric namespace publishes table-level aggregates. For provisioned-capacity tables, these five metrics drive operational decisions: MetricUnitRecommended Alarm ThresholdNotesThrottledRequestsCount> 0 (provisioned mode)Any throttling on a provisioned table means capacity is misconfigured or a hot partition is concentrating loadSuccessfulRequestLatency p99Milliseconds> 10ms (read-heavy workloads); > 20ms (mixed)p99 > 10ms on reads is a practitioner-recommended leading indicator of partition pressure before throttles appearConsumedReadCapacityUnitsCount/second> 80% of provisioned RCUsSignals you’re approaching throttle territoryConsumedWriteCapacityUnitsCount/second> 80% of provisioned WCUsSame logic for write-heavy workloadsSystemErrorsCount> 0Indicates DynamoDB service-side failures, distinct from capacity limits Practitioner-recommended starting points. Tune to your workload characteristics. ThrottledRequests at table level confirms that throttling happened, but tells you nothing about which partition caused it. On a table with millions of items, a single access pattern (a user ID acting as a partition key hot spot, for instance) can drive 95% of throttles while aggregate consumed capacity looks healthy. DynamoDB Contributor Insights resolves this. Contributor Insights for Hot Partition Detection DynamoDB Contributor Insights surfaces the top-N most-accessed partition keys and sort keys in real time. It identifies the specific items driving throttling or high latency that pure CloudWatch metric aggregation can’t surface. Enabling it on a production table with significant traffic incurs cost (priced per request evaluated), but during a throttle incident, Contributor Insights gives you the specific key value generating excess load rather than an aggregate curve. Enable it from the DynamoDB console under the table’s “Monitor” tab, or via CLI (requires AWS CLI v2+): Plain Text aws dynamodb update-contributor-insights \ --table-name YOUR_TABLE_NAME \ --contributor-insights-action ENABLE Once active, CloudWatch Logs Insights receives partition-level data within minutes. Query the top-10 most-accessed partition keys over the past hour to confirm whether a hot key is generating the throttle alarm: Plain Text filter @message like /ContributorInsights/ | stats count(*) as accessCount by partitionKey | sort accessCount desc | limit 10 Capacity Mode Decision Logic The decision between provisioned and on-demand capacity modes depends on traffic predictability. Use a 7-day ConsumedCapacityUnits trend as your input signal: If consumed capacity stays below 80% of provisioned capacity and follows a consistent daily pattern, stay on provisioned. Set auto-scaling target utilization at 70% of provisioned capacity to leave headroom for traffic spikes before throttling begins.If consumed capacity regularly exceeds 80% of provisioned, or if usage patterns show irregular spikes with no predictable shape, on-demand mode eliminates throttling risk at a higher per-request cost. Teams running the DynamoDB zero-ETL integration with Redshift (GA October 2024) face a different monitoring angle from streaming replication. The integration operates via periodic incremental exports every 15 to 30 minutes, so source table latency doesn’t affect export timing. The primary constraint on analytics data freshness is export completion status, visible in the Redshift console under the integration view. Export failures are the leading indicator of stale analytics data. ElastiCache: Cache Efficiency, Memory Pressure, and the Valkey 8.0 Observability Upgrade When cache hit rate drops, the blast radius extends beyond ElastiCache. Every cache miss becomes a direct read against your origin datastore, and if that origin is a DynamoDB table already running near provisioned capacity, you get the throttle cascade from the introduction. The metrics below separate cache-level symptoms from the memory and replication signals that predict them, followed by the observability improvements Valkey 8.0 brings. Redis and Valkey Metrics Per the ElastiCache CloudWatch documentation, the metrics that drive operational decisions for Redis and Valkey deployments are: MetricTargetAlert ThresholdActionCacheHitRate>= 0.95< 0.90Investigate at < 0.90; below 0.80 indicates a significant access pattern change or deployment that altered cache key patternsEvictions~0 (steady state)> 100/min sustainedSustained evictions mean maxmemory-policy is evicting live data under memory pressureDatabaseMemoryUsagePercentage< 70%Alert at > 75%; scale-out at > 85%Alert at 75% gives runway to analyze dataset growth; above 85% triggers automatic evictions under most policiesReplicationLag< 100ms> 500msReplica lag at this level affects read scaling reliabilityCurrConnectionsWorkload-specific> 80% of max allowedPersistent near-limit connections indicate a connection pool misconfiguration or application-side leak Practitioner-recommended starting points based on operational experience. Memcached deployments within ElastiCache expose a different metric set through the same AWS/ElastiCache namespace: get_hits and get_misses (from which you derive hit rate), evictions, and bytes_used vs. limit_maxbytes. Valkey and Redis are cluster-based architectures with native replication, while Memcached is a horizontally partitioned cache with no native replication. Applying Redis/Valkey thresholds to Memcached deployments produces misleading alarms. Valkey 8.0 Observability Additions The open-source Valkey 8.0 release shipped from the Linux Foundation on September 16, 2024. Amazon ElastiCache 8.0 for Valkey launched on November 21, 2024, bringing four observability primitives that prior Redis OSS metrics on ElastiCache didn’t expose. Per-slot metrics let you identify which hash slots carry disproportionate traffic across a cluster. Before Valkey 8.0, CloudWatch surfaced per-node and per-cluster aggregates only. A slot-level throughput imbalance (common after a key pattern change in the application layer) was invisible until it produced node-level CPU or memory pressure. With per-slot metrics, you detect the asymmetry before it cascades to node-level saturation. Per-client event loop latency tracks how long each client connection waits in the event loop queue. This directly diagnoses client-specific throughput asymmetries. If one application service has a misconfigured connection pool producing tail latency that appears as a CacheHitRate degradation from another service’s perspective, per-client event loop latency identifies the offending client specifically rather than surfacing a cluster-level aggregate that implicates everything. Rehash memory tracking quantifies the temporary memory overhead during cluster rescaling. When you add nodes to an ElastiCache Valkey cluster, the rehashing process requires holding two copies of some hash-slot data in memory simultaneously. Before this metric, a DatabaseMemoryUsagePercentage spike during a scale-out event was ambiguous. With rehash memory tracking, you can confirm the spike is transient rehash overhead and dismiss the alarm as expected behavior rather than a capacity problem. Traffic breakdowns split read, write, and key expiry operations at the slot and node level. This replaces the single-dimensional throughput view that prior ElastiCache Redis metrics provided and enables you to identify whether a throughput increase is driven by reads, writes, or expiry churn without writing custom instrumentation. Valkey 8.1, released April 2, 2025, adds further observability improvements. Verify ElastiCache 8.1 availability in your region at the time of deployment, as managed service version availability can trail the open-source release by several weeks. Redshift: Query Performance, WLM Configuration, and Enhanced Monitoring Redshift performance problems tend to look identical from the outside: queries slow down. Whether the cause is CPU saturation, WLM slot exhaustion, or a bad query plan requires different metrics and different responses. The thresholds below separate those conditions, followed by the Enhanced Query Monitoring tooling that replaced the manual system-table workflow for root-cause diagnosis. Key CloudWatch Metrics and WLM Thresholds MetricRecommended ThresholdActionCPUUtilizationAlert at > 80%Investigate active query plans if sustained; evaluate concurrency scaling if combined with queue depthWLMQueueLength (per queue)Alert at > 3; escalate at > 5 sustained for 60 secondsWLMQueueLength > 5 sustained over 60 seconds combined with CPUUtilization > 85% is a practitioner-recommended trigger for enabling a Redshift concurrency scaling clusterQueryQueueTime> 30 secondsQueries waiting over 30 seconds indicate WLM queue saturation or slot misconfigurationQueryDuration2x the 7-day p95 baseline for that WLM queueBaseline drift detection for workload-specific thresholdsReadIOPSCluster baselineSharp ReadIOPS spikes without a corresponding query load increase can indicate full-table scans or missing sort key filters The WLMQueueLength threshold requires context to interpret correctly. A WLMQueueLength of 5 on a queue allocated 5 concurrency slots means every slot is occupied and the queue is at capacity. Combined with CPUUtilization above 85%, adding concurrency scaling capacity is the right response. WLMQueueLength of 5 with CPUUtilization at 40% points to a slot allocation problem: queries are queuing behind slot limits rather than behind compute saturation, and the fix is WLM reconfiguration, not additional nodes. Historically, diagnosing slow Redshift queries required direct access to system tables. A typical workflow queried STL_QUERY for execution times, joined to SVL_QUERY_METRICS for resource usage per execution step, and cross-referenced SVL_QUERY_SUMMARY for operator-level plan details. This three-step workflow required SQL client access, familiarity with the Redshift internal catalog schema, and significant manual correlation work. Redshift Enhanced Query Monitoring Redshift Enhanced Query Monitoring went GA on January 29, 2025, available for both Serverless and provisioned deployments. It surfaces query bottlenecks, execution plan anomalies, and resource contention at the query level through the Redshift console, removing the need for SQL-level diagnostic work against system tables. When WLMQueueLength spikes, you can go directly to a ranked list of the queries causing saturation, see their execution plan highlights, and identify whether the bottleneck is a sort key miss, a cross-join, or a network shuffle between nodes, all without writing a single STL_QUERY lookup. Redshift troubleshooting previously required a senior engineer with DBA-level knowledge of the system catalog. This change shifts basic performance diagnosis to any SRE comfortable with the console. AI-Driven Scaling and Its Monitoring Implications AWS previewed Redshift Serverless AI-driven scaling at re:Invent 2023, and it went GA in October 2024. Verify current GA status in the AWS documentation for your region before production adoption, as the preview-to-GA timeline varies by feature and region. AI-driven scaling automates RPU (Redshift Processing Unit) allocation by observing query patterns over time and adjusting base and max RPU settings to balance cost against performance. WLM queue priority, query monitoring rule configuration, and workload classification for mixed BI and ETL environments require manual configuration even on Serverless clusters running AI-driven scaling. A Redshift Serverless cluster with AI-driven scaling still requires you to define how ETL jobs and ad hoc analyst queries share resources, and which queue takes priority when both arrive simultaneously. Those decisions drive WLMQueueLength behavior regardless of how accurately the scaler provisions RPUs. Capacity Planning: Using Monitoring Data to Drive Scaling and Cost Decisions The cross-service capacity heuristic worth building into your runbooks: simultaneous DynamoDB p99 latency increase combined with ElastiCache CacheHitRate dropping below 0.90 can indicate several different conditions. Potential causes include a fan-out query change at the application layer, a cache node failure, a network event between services, or a deployment that altered cache key patterns. This symptom combination warrants application-layer investigation to confirm the root cause before deciding which service to scale. Scaling either service without confirming the shared trigger wastes capacity and can mask the actual issue. DynamoDB Build a 7-day ConsumedCapacityUnits average as your baseline, then set auto-scaling target utilization at 70% of provisioned capacity. This gives your table headroom to absorb a 30% traffic increase before auto-scaling triggers, with a further buffer before you hit throttles at 100% consumed capacity. When evaluating reserved capacity, AWS Cost Explorer surfaces DynamoDB reserved capacity recommendations with projected savings. At a 3-year term commitment, reserved capacity can save up to 77% versus provisioned capacity hourly rates. Reserved capacity makes financial sense for tables that have run in provisioned mode for at least 90 days with predictable consumption patterns. For tables with volatile or seasonal traffic, on-demand mode avoids the risk of underutilization that makes reserved capacity economically counterproductive. ElastiCache Trend DatabaseMemoryUsagePercentage over a 72-hour window. If it trends upward at a rate disconnected from traffic growth (the cache dataset is growing while the request rate stays flat), that signals cache dataset expansion rather than increased load. The operational response is node scaling before you cross the 75% alert threshold, as memory pressure at that level narrows your runway to eviction-level problems. For ElastiCache Serverless using Valkey, monitor ElastiCacheProcessingUnits (ECPUs) as the scaling proxy. ECPU consumption scales with operation complexity and data volume, making it the primary cost and capacity signal for Serverless deployments where node count decisions don’t apply. Redshift Correlate CPUUtilization with QueryQueueTime over a 1-week window. The CPU-vs-queue diagnostic from the Redshift metrics section applies here as your scaling decision input: high CPU points to node scaling, while high queue time with moderate CPU points to WLM slot reconfiguration. Where CloudWatch’s Coverage Falls Short The per-service metrics and tooling above give you solid visibility within each namespace. The gaps show up when you need to work across them: correlating alarms from different services, connecting logs to metrics, and suppressing the noise when a single event triggers alerts everywhere at once. No Native Cross-Service Correlation You can build a CloudWatch dashboard that co-locates DynamoDB ThrottledRequests, ElastiCache Evictions, and Redshift WLMQueueLength on a shared timeline, but it’s manual widget assembly with no causal linking between the graphs. The assembly is also fragile: every new table, cluster, or queue requires manual dashboard updates to keep the view current. Log-to-Metric Correlation Is Manual Connecting a slow Redshift query logged in STL_QUERY to a spike in DynamoDB SuccessfulRequestLatency at the same timestamp requires opening CloudWatch Logs Insights for Redshift audit logs, querying by timestamp range, then manually comparing results against the DynamoDB metric timeline. The Enhanced Query Monitoring GA from January 2025 reduces this friction for Redshift-internal diagnosis, but the cross-service correlation step remains a human task. Cross-Account Visibility CloudWatch Database Insights added cross-account and cross-region support for database fleet monitoring on November 21, 2025. Verify the current scope of service coverage at the time of your deployment, as the announcement references database fleet monitoring broadly, and the specific inclusion of ElastiCache and Redshift alongside RDS and Aurora should be confirmed against current documentation. Alert Fatigue Across Three Namespaces Each service generates its own alarm stream with no dependency-aware suppression between services. When a single network event causes DynamoDB latency to rise, ElastiCache hit rate to drop, and Redshift WLM queue depth to increase, CloudWatch fires alarms across three separate notification channels simultaneously. The on-call engineer receives three alerts for a single root cause event, with no automated path from any alarm to the triggering condition. ManageEngine OpManager Nexus addresses these gaps directly: it auto-discovers DynamoDB tables, ElastiCache clusters, and Redshift clusters within your AWS account, builds correlated dashboards that connect metrics across all three services on a shared timeline without manual widget assembly, and applies dependency-aware alarm suppression that treats downstream symptoms of a single event as a grouped incident. For teams running two or more of these managed database services, the operational delta between nine isolated CloudWatch alarms and a correlated, root-cause-linked view determines where monitoring hours get spent or recovered. Your Monitoring Baseline: Nine Alarms and a Unified View The minimum viable monitoring baseline for all three services is nine CloudWatch alarms routed to a single SNS topic. These are practitioner-recommended starting points. Tune each threshold to your observed workload behavior. DynamoDB Alarms Alarm NameMetricThresholdEvaluation PeriodDynamoDB-ThrottlesThrottledRequests> 01 minuteDynamoDB-LatencyP99SuccessfulRequestLatency (p99)> 20ms5 minutesDynamoDB-RCUHighConsumedReadCapacityUnits> 80% of provisioned5 minutes Metric definitions: DynamoDB CloudWatch metrics reference. ElastiCache Alarms Alarm NameMetricThresholdEvaluation PeriodCache-HitRateLowCacheHitRate< 0.905 minutesCache-EvictionsHighEvictions> 100 per minute1 minuteCache-MemoryHighDatabaseMemoryUsagePercentage> 75%5 minutes Metric definitions: ElastiCache CloudWatch metrics reference. Redshift Alarms Alarm NameMetricThresholdEvaluation PeriodRedshift-CPUHighCPUUtilization> 80%5 minutesRedshift-QueueDepthWLMQueueLength> 35 minutesRedshift-QueueWaitQueryQueueTime> 30 seconds5 minutes Metric definitions: Redshift CloudWatch metrics reference. Route all nine alarms to a single SNS topic. Tag each alarm with a Service dimension (values: DynamoDB, ElastiCache, Redshift) so your incident management tooling can filter and group by service. This configuration puts all three alarm streams in one place and makes it detectable when multiple service alarms fire within a short time window, which is the observable signature of a cross-service cascade. Run these nine alarms for a week or two. You’ll see the pattern: multiple alarms firing within the same minute window for what turns out to be a single root cause, with no automated way to connect them. That delta is what a correlated observability layer closes. ManageEngine OpManager Nexus provides that layer for AWS database services, with auto-discovery, cross-service dashboards, and dependency-aware alarm suppression out of the box. What’s your current setup for correlating alarms across managed AWS services? If you’re running DynamoDB, ElastiCache, or Redshift and have found thresholds or approaches that work well for your team, share them in the comments.

By Damaso Sanoja
Architecting Petabyte-Scale Hyperspectral Pipelines on AWS
Architecting Petabyte-Scale Hyperspectral Pipelines on AWS

The Data Challenge Every industry has its version of the same data engineering problem: massive, complex payloads generated at the edge — far from the cloud, often on unreliable networks — that need to become queryable, structured datasets as fast as possible. In genomics, it is multi-gigabyte sequencing files produced by instruments in labs. In autonomous vehicles, it is LiDAR and camera telemetry streaming off test fleets. The underlying architectural challenge is the same in every case: ingest heavy data at burst scale, store it cost-effectively for years, and transform it into something an analyst or ML model can actually use without touching the raw files. This article uses hyperspectral imaging in digital agriculture as the concrete use case, but the architecture is designed to be general-purpose and replicable. Hyperspectral sensors capture light across hundreds of spectral bands, making it possible to detect water stress, nutrient deficiencies, and early disease in crops well before anything is visible to the human eye. A single sensor pass over a 160-acre field generates 40–80 GB of raw data. These are not images in any conventional sense — they are three-dimensional tensors, often called “hypercubes,” where every spatial pixel carries reflectance measurements across 200 or more contiguous spectral bands. The files arrive in scientific formats like HDF5, NetCDF, or ENVI, which do not support partial reads over a network without specialized tooling. Loading an entire 4 GB cube into memory just to extract a vegetation index from three bands is wasteful at the small scale and operationally unaffordable once a mid-size operation is producing 5–10 TB of raw cubes per growing season. The architecture described here solves that problem end to end: from raw sensor capture to queryable, structured tables in the cloud with cost-efficient storage and minimal dependency on network bandwidth. The patterns — event-driven ingestion, aggressive storage tiering, medallion lakehouse design, and containerized edge processing — are all portable. Swap the hyperspectral cube in this architecture pattern for a FASTQ file or a LiDAR point cloud, and the same blueprint applies with very minimal modifications. Ingestion: Handling Seasonal Burst Traffic Agricultural data arrives in extreme seasonal bursts. During harvest, hundreds of edge nodes may be uploading simultaneously; in winter, the pipeline sits nearly idle. Any architecture that provisions fixed compute for this pattern is going to be very inefficient, so the ingestion layer needs to scale to near-zero in both directions. The pipeline uses an S3 → SQS → Lambda → Batch pattern, and the SQS queue in the middle is what makes the rest of it work. When files land in S3, event notifications route into the queue, which acts as a buffer between the unpredictable arrival rate and the compute layer downstream. Lightweight Lambda functions essentially like an air traffic controller poll the queue, bundle incoming file references into manifest batches of 50–200 cubes, and submit those manifests to AWS Batch. Batch spins up Spot Instances to do the actual heavy processing. Triggering Lambda directly from S3 events was the first approach, but it breaks down at scale for two reasons: Lambda’s concurrency limits create a hard ceiling during burst ingest, causing silent throttling and dropped events, and the 1:1 mapping between files and Lambda invocations is inefficient when the processing works much better against batches of files. Putting SQS in the middle solves both problems at once. When selecting the compute environment, AWS Batch ultimately won out over the alternatives after some evaluation. The main limitation of Fargate was its hard memory ceiling of around 30 GB. This was simply too tight for processing a 4 GB data cube with intermediate arrays in memory that can easily require 32–64 GB of RAM. Batch also provides native handling for job queuing, retries, and Spot interruption recovery. Since the workload is highly parallel and interruption-tolerant, this capability allowed us to safely leverage Spot pricing, delivering a significant 60–90% cost reduction that would have been difficult to justify passing up. One early lesson involved S3 prefix design. A flat raw/ prefix structure ran into per-prefix request rate limits (3,500 PUTs/second) during burst ingest, which caused throttling that was initially difficult to diagnose. Restructuring to region/farm_id/year/month/day/ spread the writes across thousands of unique prefixes and also aligned neatly with the partition scheme used by Athena and Trino downstream, so the same naming convention solved both the throughput problem and the query performance problem. Storage: Managing Petabyte-Scale Costs At this scale, storage costs will quietly become the largest line item in the project if the tiering strategy is not aggressive from day one. Petabytes of data at $0.023/GB/month in S3 Standard add up fast, but deleting raw scientific data is not an option due to regulatory reasons and for future model improvements. The lifecycle strategy moves successfully processed cubes to Glacier Instant Retrieval within 24 hours. The initial instinct was to go straight to Deep Archive, but in practice, about 5–8% of cubes get retrieved within the first year—sensor calibrations get updated, new vegetation index algorithms need validation against historical data, and so on. Deep Archive’s 12-hour restoration time makes that retrieval workflow painful enough to slow down the R&D cycle. Glacier IR runs at roughly $0.004/GB/month, about 6x cheaper than Standard, with millisecond retrieval. After a year, once retrieval rates drop below 1%, a second lifecycle rule transitions everything to Deep Archive. The important detail in the lifecycle configuration is a tag-based filter that gates the transition on processing_status = complete. Without this check, cubes that failed processing end up in Glacier, and restoring them for a retry becomes an unnecessary expense that multiplies quickly during periods of high ingest. SQL # Terraform: Tiered lifecycle for raw HSI cubes resource "aws_s3_bucket_lifecycle_configuration" "hsi_raw" { bucket = aws_s3_bucket.raw_hsi_data.id rule { id = "raw_cubes_to_cold_storage" status = "Enabled" filter { and { prefix = "raw_cubes/" tags = { processing_status = "complete" } } } transition { days = 1 storage_class = "GLACIER_IR" } transition { days = 365 storage_class = "DEEP_ARCHIVE" } } The Lakehouse: From Cubes to Queryable Tables Everything upstream exists to feed this layer. The goal is to get the R&D team off the cycle of downloading, unzipping, and parsing multi-gigabyte cubes every time they need to calculate a vegetation index or train a model. The lakehouse is built on a medallion pattern using Apache Iceberg, organized around an extract-once, query-many principle. Iceberg was chosen over plain Parquet files on S3 with a Glue Catalog because three problems kept recurring during development. First, schema evolution: Flexibility for new sensors with different band configurations, and Iceberg handles column additions without rewriting historical data. Second, time travel: when a calibration error is discovered, rolling the Silver table back to a previous snapshot is a straightforward operation rather than a data recovery project. Third, hidden partitioning: Iceberg derives partition values from column data at write time, which means queries on acquisition_date get automatic partition pruning. Medallion Layers Bronze (Standardized Cubes) Calibrated for sensor noise and atmospheric interference, stored in cloud-optimized format (Zarr or COG), retaining the full 3D spectral structure. This layer serves as the reproducible starting point for all downstream processing — if an algorithm changes six months later, reprocessing starts from Bronze rather than from the raw archive sitting in Glacier. Silver (Structured Reflectance) The 3D tensors are flattened into Iceberg tables where each row represents a spatial coordinate, and each column holds a band’s reflectance value, partitioned by farm_id and acquisition_date. The Bronze-to-Silver transformation is the most compute-intensive step in the pipeline. Gold (Business-Ready Metrics) Pre-computed agricultural indices — NDVI, NDWI, chlorophyll estimates — aggregated by crop, field row, and time period. These are the tables that dashboards query, that yield prediction models train on, and that agronomists use to make irrigation and fertilization decisions. With data in this shape, Trino handles federated SQL across the Silver and Gold tables for ad-hoc analysis, and ML training pipelines read directly from Silver without any file wrangling. The most valuable analytical work comes from joining Gold-layer crop health metrics with non-spectral datasets across the organization, and those cross-domain joins are where insights about field-level yield variation actually emerge, which is something no single dataset can surface on its own. From Pixels to Decisions: Automating the Breeding Pipeline To make this pipeline actually valuable to the business, this has to go beyond just calculating a vegetation index. The Gold layer is where pixels turn into decisions. For example, in crop breeding programs, teams test thousands of seed varieties across different microclimates to see which ones survive drought or resist disease. Agronomists do not have time to look at thousands of heatmaps; they need automated, binary outcomes. By joining the structured hyperspectral data in the Gold tables with field boundaries and historical yield databases, the system applies predefined business logic to automatically flag which genetic lines are failing. This generates concrete "Advance" or "Discard" recommendations for the breeding pipeline. At this stage, the data stops being a scientific image and becomes a direct, automated trigger for the next planting cycle. Edge Deployment: Processing at the Source The bandwidth at some of these remote locations makes a cloud-only approach unrealistic. A 4 GB cube over a 50 Mbps rural LTE connection takes over 10 minutes under ideal conditions, and rural LTE rarely delivers ideal conditions. Multiply that by dozens of passes per day during peak season, and the uplink becomes the dominant bottleneck in the entire system. The first round of processing has to happen on the equipment itself. One Container, Two Targets For managing the single OCI-compliant processing container at the edge, both AWS IoT Greengrass and K3s were considered. While Greengrass provides tight, convenience-focused AWS integration for features like device shadows, OTA updates, and managed MQTT bridging, the long-term architectural goal heavily prioritizes operational independence and portability. K3s was the pick here — it runs fully offline after bootstrap, uses standard Kubernetes manifests, and avoids locking the edge layer into a single vendor. This commitment to a lightweight, standard Kubernetes runtime avoids vendor lock-in at the crucial edge layer and provides the essential flexibility needed should a multi-cloud strategy become necessary. The edge container performs radiometric calibration and spectral flattening, producing a Parquet file that is typically 50–100x smaller than the raw cube. That compression ratio is what makes the entire edge strategy viable — the processed output is small enough to upload over cellular, while the raw cube would take orders of magnitude longer. Hardware and Sync Hyperspectral processing is dominated by dense matrix multiplications across hundreds of bands, which requires GPU hardware. The setup uses ruggedized NVIDIA Jetson AGX Orin modules mounted directly on field equipment, providing the CUDA cores needed to run CuPy-based calibration and flattening in near real-time. The sync strategy splits on payload size and urgency. Processed Parquet files stream back to the cloud in near real-time via Amazon MSK (Kafka) over an MQTT bridge, giving the lakehouse immediate telemetry. Kafka was chosen over SQS for this link because the downstream Spark Structured Streaming jobs benefit from offset-based replay semantics — if a job fails mid-batch, it resumes from the last committed offset without data loss or duplication, which is harder to guarantee cleanly with SQS visibility timeouts. The raw cubes stay on local storage and are only backhauled when the equipment returns to a facility with a high-speed connection, keeping bandwidth costs under control. Summary The core ideas behind this pipeline are straightforward: decouple storage from compute using SQS as a buffer, push the first round of processing to the edge so bandwidth stops being the bottleneck, tier storage aggressively so petabyte-scale retention stays economical, and structure everything into a medallion lakehouse so end users get SQL tables instead of binary blobs. Each piece is well-understood on its own; the value is in how they compose into an end-to-end system that stays reliable and cost-effective at scale. As noted at the outset, none of this is specific to agriculture. The hyperspectral cube is just one instance of a pattern that shows up across industries — genomics, satellite imagery, LiDAR, manufacturing inspection — wherever heavy payloads are born at the edge and need to become queryable data in the cloud. The crop science forced this architecture into existence, but the blueprint is portable. Swap the payload and the domain-specific transforms, and the rest of the system carries over.

By Anil Bodepudi
Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs
Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs

Introduction: Beyond Compute Prices When migrating or running SAP S/4HANA on AWS, many organizations fixate on EC2 instance prices and assume that choosing the cheapest instance types will yield the biggest savings. In reality, cloud TCO is heavily impacted by landscape design choices, how many environments you run, how they’re sized, how data is managed and what auxiliary services you use. Cutting cloud costs isn’t just about shrinking VM sizes it’s about architecting an efficient SAP landscape. As one SAP FinOps guide notes, focusing only on instance sizing addresses symptoms, not causes. True cost optimization asks Is the SAP landscape design efficient? Are you running unnecessary SAP instances, and can workloads consolidate onto fewer systems?. In other words, a thoughtful landscape architecture often yields larger savings than a simple per-server cost reduction. Understanding an SAP S/4HANA Landscape on AWS A typical S/4HANA landscape consists of multiple tiers and environments. You might have separate DEV, QA, Staging and Production systems each a full SAP stack with its own HANA database and application servers. On AWS, that could translate to dozens of EC2 instances, along with associated storage and network infrastructure. Each additional environment or system copy multiplies costs for compute, Amazon EBS storage, Amazon EFS shared file systems, backup retention, and so on. Landscape design decisions such as how many parallel systems to run or whether every environment needs high availability can quickly outweigh the cost of an individual EC2 instance. Right-Sizing Compute Resources Right-sizing is the practice of matching instance types and sizes to actual workload needs. SAP S/4HANA is resource-intensive, so it’s critical to choose the appropriate EC2 instance families and sizes for each component. AWS offers SAP-certified instance families. Avoid the temptation to oversize just in case use monitoring tools like AWS CloudWatch and SAP’s EarlyWatch reports to gauge real utilization. If a QA system never exceeds 30% CPU and 50% memory, you might run it on a half-sized instance compared to production. Many companies set policies such as development instances must not exceed 50% of production capacity and QA 70%. This ensures non-production systems are proportionally smaller and cheaper. In Terraform, you can parameterize instance sizes by environment to enforce right-sizing. A production vs. dev HANA server might be expressed as: Plain Text # Example Terraform: Use smaller instance type for non-production variable "env" { default = "prod" } resource "aws_instance" "sap_hana" { ami = "ami-0abcdef12345..." # SAP HANA Linux AMI instance_type = var.env == "prod" ? "r6i.8xlarge" : "r6i.2xlarge" # ... (other configuration like VPC, subnet, security groups) tags = { Name = "${var.env}-hana" Environment = var.env } } In this snippet, a development environment could be launched with -var env=dev to automatically use a smaller instance, whereas production uses r6i.8xlarge. Right-sizing combined with flexible IaC lets you avoid paying for capacity you don’t need while still meeting SAP performance requirements. Beyond instance selection, leverage cost-saving options for compute: Savings Plans or Reserved Instances: If your SAP workloads run 24/7 in prod, commit to a one- or three-year Savings Plan to get discounts up to 72%.Auto-stop Non-Prod Instances: Schedule stops for dev, QA, training servers during off-hours. AWS Systems Manager Automation or AWS Instance Scheduler can start/stop instances on a cron schedule. By only running non-prod when needed, you save significantly on compute.Auto Scaling for SAP App Servers: SAP application servers can often scale horizontally. In AWS, you might use an Auto Scaling Group with a schedule or target utilization policy for app servers. This way, you run minimal servers during light load and scale out for peak times. Consolidation and Landscape Efficiency An inefficient SAP landscape one with too many duplicate systems or low-utilization servers will rack up cloud costs regardless of instance pricing. Cloud gives us flexibility to consolidate and optimize: Eliminate Unnecessary Systems: Audit your SAP instances are there old project systems or unused sandboxes running? It’s not uncommon to find forgotten test systems left on. Retire or shut down what isn’t truly needed.Consolidate Workloads: Where possible, consolidate multiple workloads on a single instance or platform. If you have separate SAP S/4HANA instances for different business units that are lightly used, consider consolidating them into one S/4HANA tenant or system. Fewer HANA databases means fewer high-memory instances to pay for. SAP HANA supports multi-tenant databases, so multiple schemas can reside in one HANA system this can be a way to run dev and QA on one HANA VM as separate tenants, rather than two separate VMs.Shared Services: Some landscape components can be shared across environments. For instance, a single SAP Solution Manager or central SAProuter can serve the entire landscape rather than one per environment. Fewer supporting servers equals lower cost.Right-Size Every Environment: Even within a consolidated landscape, differentiate the sizing. We mentioned limiting dev/QA to a fraction of prod. Also consider if every environment needs the same number of app servers maybe prod has 4 app nodes for high throughput but QA can do with 2 and dev with 1. This scaling down translates directly to cost savings in EC2 hours and licenses. Keep in mind that consolidation should not compromise testing realism or performance SLAs for production. It’s a balance consolidate and downsize where you safely can and use cloud tooling to isolate or simulate full scale only when necessary. Storage and Data Management Costs For SAP workloads, storage costs are often as significant as compute. A single S/4HANA instance may have terabytes of data on EBS volumes. Now multiply that by multiple environments, plus backups storage can eclipse compute costs if not managed. AWS provides multiple storage options using the right one for the right purpose is key: Use EBS Efficiently: Provision EBS volumes that meet performance needs without over-provisioning IOPS or size. AWS now recommends gp3 SSD volumes for SAP HANA over older gp2, as gp3 offers better price/performance. Only use expensive io2 volumes if you truly need ultra-high IOPS and durability for critical workloads, otherwise gp3 suffices in most cases. Always enable the delete on termination flag for temporary volumes and clean up unattached EBS volumes so you’re not paying for leftover storage.Offload Backups to S3: Don’t keep backup files on EBS or EFS longer than necessary. AWS offers the Backint agent for SAP HANA which lets HANA back up directly to Amazon S3. This bypasses the need for large intermediate disk space and leverages cheaper object storage. S3 is significantly cheaper per GB than EBS for data at rest. Design a backup strategy for each environment and send those to an S3 bucket. From there, apply lifecycle policies to move older backups to colder storage classes like Glacier for further savings. For example, you might keep 7 days of recent backups in S3 Standard, then transition older ones to S3 Glacier or Deep Archive after 30 days. Plain Text # Example Terraform: S3 bucket for SAP HANA backups with lifecycle policy resource "aws_s3_bucket" "sap_hana_backup" { bucket = "my-sap-hana-backups" force_destroy = true # allow auto-cleanup if destroying infra versioning { enabled = false # disable versioning for backup objects to save space } lifecycle_rule { id = "MoveOldBackupsToGlacier" enabled = true transition { days = 7 storage_class = "GLACIER" # move backups to Glacier after 7 days } expiration { days = 180 # delete backups after 6 months } } tags = { Purpose = "SAP HANA Backups" } } Terraform snippet: The above S3 bucket is configured to automatically transition objects older than 7 days to Glacier and delete anything older than 180 days. This kind of policy ensures your S3 storage costs stay low by archiving cold data. In practice, set the timing according to your retention requirements. Also consider enabling MFA Delete or Vault Lock on critical backup buckets for safety, instead of versioning. Use EFS for Shared Files, but Lifecycle Manage It: SAP applications often use shared file systems for transports (/usr/sap/trans), global SAP mounts (/sapmnt), and archives. Amazon EFS is ideal for this shared storage it’s managed NFS and can be mounted by multiple EC2 instances. However, treat EFS space as premium (especially the default Standard storage class). Enable EFS Lifecycle Management (Intelligent-Tiering) so that files not accessed for 30 days move to the lower-cost Infrequent Access tier automatically. For example, old transport files or archived data can sit in EFS IA at a much lower cost per GB. Also, clean up EFS after major projects. Deleting those or moving them to S3 after the project frees up costly EFS space. Plain Text # Example Terraform: EFS file system with lifecycle policy for infrequent access resource "aws_efs_file_system" "sap_shared_fs" { creation_token = "sap-shared-fs" performance_mode = "generalPurpose" throughput_mode = "bursting" lifecycle_policy { transition_to_ia = "AFTER_30_DAYS" # move files to Infrequent Access after 30 days } tags = { Name = "sap-shared" } } The above EFS definition will automatically tier off files not touched for 30 days. Mount this EFS on your SAP application EC2s to use for common directories. This way, you get the convenience of shared storage without continuously paying full price for cold data. Always review and delete any unattached or unused EFS file systems as well. Archive and Purge Data: A broader data strategy can greatly reduce TCO. If your S/4HANA database is bloated with years of transactional data, consider using SAP data archiving to move old data to cheaper storage. Storing infrequently accessed data in S3 is far cheaper than keeping it in memory on HANA. Also, use Amazon S3 for storing large interface files or logs rather than keeping them on EBS/EFS, and enable lifecycle policies for those as well. Every GB you offload from expensive storage to S3/Glacier or delete entirely is money saved. Network and Infrastructure Considerations Often overlooked in cost planning are networking and auxiliary infrastructure costs: Networking: Within a VPC, data transfer is free between instances in the same AZ, but costs can incur across AZs or out to the internet. If your SAP landscape replicates data, you’ll pay for cross-AZ data transfer. This is usually worth the HA benefit, but be aware. More straightforwardly, NAT Gateway costs catch people by surprise if each environment VPC has its own NAT and heavy internet egress, costs add up. Mitigation: use VPC endpoints for S3 and other services so traffic stays internal and avoids NAT usage.Backups and DR Infrastructure: If you maintain a warm standby environment or Disaster Recovery site, treat it as another environment in your cost planning. To save costs, you can keep DR systems mostly powered off, or use lower-performance instance types there, and only scale up if a failover is needed. AWS Backup can help here by storing snapshots that you can restore in a DR region on demand. Using lower-tier storage in the DR region for backups is a cost-effective strategy.AWS Managed Services: Consider using services like AWS Backup to automate backup retention policies across your SAP instances. This can ensure snapshots or EBS backups follow a schedule and transition to cold storage after a set time, reducing manual oversight and accidental cost bloat. Also leverage tagging and AWS Cost Explorer to allocate and track costs by environment or system this transparency can help identify which landscape components are most expensive and need optimization. Environment Strategy and Automation Your environment strategy should align with actual business usage patterns. Not every SAP environment needs to run 24/7 at full scale: For development, testing, training, use on-demand principles. If developers work 8am-6pm, there’s no reason to run dev systems all night. By shutting down servers during off hours, companies save 50-65% on those environments’ costs without any impact on users.Use Infrastructure-as-Code to spin up temporary environments. Create a Terraform module for a full S/4HANA stack and instantiate it for a short-term project or testing, then destroy it when done. This ensures you pay only for the time actually needed. Automating system copies/refreshes from production backups can populate these ephemeral environments with realistic data when needed.Plan fewer, well-utilized environments rather than many underutilized ones. Each additional landscape brings overhead of compute, storage and management. Wherever possible, combine roles.Enforce governance around provisioning new SAP systems. Implement approval processes that consider cost impact. Some organizations formalize this with policies so that cloud spend doesn’t sprawl uncontrolled. Conclusion The bottom line: optimizing your SAP S/4HANA landscape design is often the biggest lever for reducing cloud TCO, even more than shaving off a few percent on instance prices. AWS provides a rich toolkit various EC2 instance types, EBS/EFS storage classes, S3 tiers and management services that enable a high degree of cost control if used wisely for your SAP architecture. By right-sizing servers, turning off or consolidating what you don’t need, and leveraging services like S3, EFS lifecycle policies and AWS Backup, you tackle the true cost drivers in an SAP environment. In practice, companies that take this holistic approach have seen significant savings in their AWS bills for SAP, all while maintaining performance and reliability. The cloud’s promise is agility and efficiency with a practical engineering mindset and Infrastructure-as-Code automation, you can achieve an efficient SAP landscape that delivers on that promise, ensuring your cloud spend is as optimized as your SAP operations.

By Deepika Paturu
Manual Investigation: The Hidden Bottleneck in Incident Response
Manual Investigation: The Hidden Bottleneck in Incident Response

Every engineering team I talk to has the same problem. When a P1 fires, coding stops. An engineer gets pulled in, spends 30 to 60 minutes hunting through logs, tracing requests across three or four systems, and cross-referencing deployment history before they can even form a hypothesis about what broke. By the time they have a diagnosis, they've already burned the better part of their morning. We've normalized this. It's just become part of the job. But the math is brutal: A team handling 50 incidents per month at 4 to 8 hours of resolve time each is looking at 200 to 400 engineering hours lost. That's a full month of a senior engineer's capacity dedicated entirely to looking backward. The investigation workflow itself hasn't changed in 20 years. Why Manual Investigation Breaks Down in Modern Systems Traditional incident response was designed for simpler architectures. An on-call engineer would look at a dashboard, pull some logs, and apply tribal knowledge to find the cause. For known failure patterns with established runbooks, this still works. Modern distributed systems are a different animal. A single error can originate in one service, propagate through a message queue, surface in a database connection pool, and present to the user as a generic 500 error. Tracing that sequence manually means jumping between your observability platform, your deployment tool, your APM, and whatever documentation exists for the relevant service. Four problems make this worse: Multi-system correlation. Errors don't stay in one place. Engineers have to manually trace a transaction across APIs, databases, and third-party dependencies to find where things actually broke.Signal-to-noise ratio. A production system generates thousands of log entries per second during a normal minute and far more during an incident. Finding the meaningful signal by hand is slow and error-prone.Context reconstruction. Understanding the root cause requires knowing what changed recently, such as deployments, config updates, and infrastructure changes. That information is scattered across tools with incompatible formats and permission models.Cognitive load under pressure. During a P0, engineers are simultaneously investigating, making decisions, and fielding status requests from stakeholders. Typically, no one person does all three of these well at once. Under that kind of load, things can easily get missed. Manual correlation is where investigation time disappears. The workflow needs to change. How AI Changes the Investigation Phase Now, AI does the detective work before the engineer ever opens the ticket. The alert is just the starting gun. 1. Automated Timeline Reconstruction AI correlates signals across your systems in real time. A reconstructed timeline might look like: 13:42:15 – Deployment completed13:42:47 – First timeout errors appear13:43:12 – Error rate reaches 15%13:44:03 – Database connection pool exhausted That sequence, assembled automatically, tells the engineer exactly where to look. No log-grepping required. 2. Similar Incident Matching Most incidents aren't genuinely novel. They're variations on failure patterns the team has seen before, often caused by the same underlying conditions. The challenge is that the previous incident was three months ago, handled by a different engineer, documented inconsistently, and buried in a ticketing system nobody queries. AI indexes past incidents and how they were resolved. When a new incident fires, it pulls up the closest matches instantly. "Error signature matches Issue #4532 from six weeks ago. Both followed Redis deployments. Resolution: connection pool adjustment." That's the kind of context that currently lives in one engineer's head, if anyone's. And when that engineer leaves, it's gone. 3. Parallel Hypothesis Testing With Confidence Scoring Human diagnosis is linear. We check one hypothesis, rule it out, and move to the next. Under time pressure, this sequential approach extends MTTR every time the first guess is wrong. AI evaluates multiple hypotheses simultaneously using a multi-agent validation architecture. Specialized agents analyze code changes, infrastructure metrics, and error patterns in parallel, then cross-check findings before surfacing anything to a human. The output is confidence-scored leads: High (85%): Connection pool exhaustion. Deployment v2.4 increased concurrent requests without adjusting pool size.Medium (60%): Database performance degradation.Low (25%): Third-party authentication issue. The engineer can focus immediately on the 85%. 4. Contextual Remediation Guidance Finding the root cause doesn't settle what to do next. Engineers frequently have to pause after diagnosis to hunt for runbooks, check with the original developer, or make a judgment call with incomplete information about side effects. AI covers that ground, recommending specific remediation steps based on system state and past resolutions: "Recommended action: Increase API connection pool to 100 in config/database.yml. Rolling restart required. Expect error rate to drop within 2 minutes." The Architecture Behind It Production-grade AI investigation runs on a composite architecture, not a single model, built to handle the volume, speed, and accuracy requirements of real incidents. Traditional ML handles high-volume anomaly detection and noise reduction at the signal layer. Small language models handle fast, private log parsing where latency matters. LLMs take over for synthesis and generating summaries that engineers can actually act on. Multi-agent architectures add a "critic" layer where specialized agents cross-check findings before anything surfaces to a human, which is where false positive reduction actually happens. This matters for teams evaluating whether to build internally. Connecting an LLM to Slack and pointing it at a vector database of logs is straightforward. Building a system that handles novel incidents accurately, runs during a log storm, and never sends raw customer data to a public model endpoint is not. The retrieval pipeline alone (knowing which 50 log lines are relevant out of 5 million) is a substantial engineering problem. Honestly, that's what kills most homegrown attempts. What This Means for SREs Right now, SREs spend 40 to 60% of their time on manual data gathering, repeated context reconstruction, and re-investigating failure patterns the team has already solved. That's the portion AI handles. At Strudel, we've seen teams cut investigation time from 30 to 60 minutes down to under 60 seconds on incidents where the system has relevant historical context. Engineers are still putting in the hours, just on different work: making decisions, checking the AI's conclusions, and building systems that prevent recurrence. At 50 incidents a month, that time adds up fast.

By Brian Kaufman
We Went Multi-Cloud and Almost Drowned: Lessons From Running Across AWS, GCP, and Azure
We Went Multi-Cloud and Almost Drowned: Lessons From Running Across AWS, GCP, and Azure

It started, as most bad architectural decisions do, with a PowerPoint slide from a VP who had just returned from a conference. “We need to avoid vendor lock-in,” he declared, and suddenly our platform engineer team had a mandate to distribute workloads across three public clouds. Eighteen months later, we had something that technically ran on three major public clouds (AWS, GCP, and Azure). We also had a Terraform code that made people cry and an on-call rotation nobody wanted. This is what I learned about multi-cloud strategy, not the vendor pitch but the messy reality of keeping production alive across multi-cloud boundaries. The Most Common Reason vs. The Reason That Logically Matters Vendor lock-in avoidance is the stated justification for most of the multi-cloud initiatives I have seen. It sounds sensible on paper: spread your bets, negotiate better pricing, avoid being held hostage. In practice, the switching cost argument is often considered theoretical. Nobody is ripping a mature workload off AWS and replanting it on GCP over a cost dispute. The expense of moving everything would have exceeded whatever we stood to gain. The real reasons multicloud makes sense are less glamorous if you acquire a company running on a different cloud. A specific managed service, say BigQuery for analytics or Azure AD for identity, is genuinely better than what your primary cloud offers. Sometimes the decision is made for you. A customer's regulatory or compliance obligations may demand data residency in a region that only one specific provider serves, or, in our case, a situation that required all three simultaneously. Those are valid reasons to proceed. If you cannot articulate a concrete scenario where you would actually migrate between clouds, you are paying an operational complexity, data egress cost, and management overhead that act as a tax for insurance you will never claim. Where It Truly Became Painful Initially, things seemed under control. We chose Kubernetes, spun up clusters on EKS, GKE, and AKS, and told ourselves the hard part was over. It wasn't. Networking hit its first wall. Each cloud provider handles virtual networks, traffic routes, and DNS completely differently. We spent three frustrating weeks chasing random connection failures between services split across GCP and AWS, eventually discovering our Transit Gateway was silently discarding packets in an obscure NAT scenario. Nobody warns you that there's no such thing as a common networking standard across clouds, and anyone claiming otherwise is really just pitching you their middleware. Identity and access management became our second major headache. AWS, Google Cloud, and Azure each handle permissions through completely different systems. We attempted to maintain matching role definitions across all three to ensure consistency. That approach collapsed within two months as the configurations slowly diverged into chaos. We eventually consolidated around a single identity provider and built a translation layer between them — messy, but at least we could audit it properly. Observability turned out to be our most stubborn ongoing problem. The moment a request travels across cloud boundaries, distributed tracing falls apart completely. Your metrics end up scattered across three separate platforms, each speaking its own query language, making unified monitoring feel nearly impossible. We consolidated into Datadog, which helped, but the bill was eye-watering, and we still had unknown areas at the edges. Here's what working day-to-day actually feels like. Writing a single Terraform module to spin up a managed Postgres database meant wrestling with three completely different provider APIs simultaneously. That module was 800 lines long for what is theoretically a single resource type. Multiply that by every piece of resource in the infrastructure you run, and you see the maintenance burden. What Indeed Worked After twelve months of struggle, one decision turned everything around. We stopped chasing cloud-agnosticism and started being cloud-intentional instead. That distinction is everything. Cloud-agnostic thinking pushes you toward lowest-common-denominator solutions, wrapping every service in abstraction layers, pretending all clouds are essentially identical. It's a trap that leaves you using none of them effectively. Cloud-native means deliberately matching specific workloads to the right provider, embracing each platform's native strengths, and only building abstractions at the actual boundaries where systems talk to each other. Machine learning training lived on GCP because Vertex AI and TPU access justified it. Transactional workloads stayed on AWS, where our team had years of accumulated expertise. Azure handled enterprise identity because that's simply where our customers' Active Directory already lived. The crucial mindset shift was simply accepting that each environment would naturally look different. We stopped forcing our GCP infrastructure to mirror our AWS setup. Instead, we focused on standardizing the connections between them, API contracts, event schemas, and a shared service mesh — only where cross-cloud communication was genuinely necessary. The Cost Nobody Mentions Multicloud isn't purely a technology challenge. It's fundamentally a people problem. Finding engineers who truly master even one cloud platform is already difficult. Expecting a single team to maintain deep production-level expertise across three simultaneously is simply unrealistic. We naturally drifted toward informal specialization: two people owned GCP, three focused on AWS, and one managed Azure, and those knowledge silos made being on-call genuinely miserable. Training costs are very real, too. The mental overhead of constantly switching between three different consoles, three separate command-line tools, and three completely different mental models of networking and storage is a hidden tax that never appears in any architecture review document. When You Shouldn't Attempt This If you are a startup or a team of under fifty engineers, multicloud is almost a mistake. Pick the cloud your team knows the best, use its managed services aggressively, and ship. The theoretical risk of vendor lock-in is less dangerous than the very real risk of moving slowly because your infrastructure is too complex to operate in different clouds. Even at scale, if your workloads lack a concrete reason to span multi-cloud boundaries, resist the pressure. A well-architected single-cloud deployment with proper DR will serve you better than a disintegrated multi-cloud setup held together by duct tape and YAML. Lessons Learned Treat multicloud as a practical response to specific constraints, never as a default stance. Go cloud-intentional rather than cloud-agnostic; use each provider’s strengths natively and standardize only at cloud integration boundaries. Invest generously in monitoring and networking; that's where multi-cloud complexity bites hardest. Be honest about your team’s capacity before committing to overhead that scales with the number of providers that you support. One Final Reflection In hindsight, the most valuable thing about our multi-cloud journey was not the architecture; it was the discipline forced on our service boundaries. When crossing a cloud boundary is expensive and painful, you think harder about what actually needs to be crossed. That forced intentionality made our systems design better overall, even parts that never left a single cloud. Could we have achieved the same architectural improvements simply by designing cleaner interfaces, without literally spanning multiple clouds? Probably yes. But nobody approves a budget for "Let's build better boundaries." They do approve a budget for a "multi-cloud strategy." Sometimes solid engineering quietly sneaks in through a dubious PowerPoint presentation.

By Pruthvi Raj Seknametla

Top Monitoring and Observability Experts

expert thumbnail

Eric D. Schabell

Director Technical Marketing & Evangelism,
Chronosphere

Eric is Chronosphere's Director Community & Developer. He's renowned in the development community as a speaker, lecturer, author, baseball expert, maintainer and CNCF Ambassador. His current role allows him to help the world understand the challenges they are facing with observability. He brings a unique perspective to the stage with a professional life dedicated to sharing his deep expertise of open source technologies and organizations. More on https://www.schabell.org.

The Latest Monitoring and Observability Topics

article thumbnail
Building a RAG-Powered Bug Triage Agent With AWS Bedrock and OpenSearch k-NN
Learn how a RAG-powered bug triage agent uses AWS Bedrock, OpenSearch, and dynamic scoring to automate crash analysis and routing.
June 9, 2026
by Rajasekhar sunkara
· 670 Views
article thumbnail
Amazon Quick: AWS's Agentic Workspace, Explained for Engineers
A technical deep dive into Amazon Quick — how it works, how it connects to your tools via MCP, and where it sits in the AWS agent stack.
June 9, 2026
by Jubin Abhishek Soni DZone Core CORE
· 710 Views
article thumbnail
Agentic AI Has an Observability Blind Spot Nobody Is Talking About
Production AI agents can trigger cascading failures when observability tracks what broke, but not whether the system can safely absorb remediation actions.
June 8, 2026
by Sayali Patil
· 1,011 Views · 2 Likes
article thumbnail
How to Build an Agentic AI SRE Co-Pilot for Incident Response
Build an agentic SRE co-pilot using LLMs to autonomously reason, plan, and execute incident response across complex, multi-cloud infrastructure.
June 8, 2026
by Akshay Pratinav
· 979 Views
article thumbnail
Observability for Agents and Workflows: Tracing Prompts, Tool Calls, and Business Outcomes End-to-End
Learn how to trace AI agents end to end, from prompts and tool calls to business outcomes, with observability practices for production workflows.
June 5, 2026
by Srinivas Chippagiri DZone Core CORE
· 2,195 Views · 1 Like
article thumbnail
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 2
Build a Slack bot using AWS Bedrock and MCP to answer GitHub questions. Learn setup, architecture, and how to extend it with new tools and data sources.
June 4, 2026
by Sangharsh Agarwal
· 1,780 Views
article thumbnail
Compliance Automated Standard Solution (COMPASS), Part 11: Compliance as Code, the OSCAL MCP Server Way
How AI-native tooling is finally closing the loop between compliance personas and OSCAL artifacts with an MCP-standardized, AI-agent-ready interface.
June 4, 2026
by Yuji Watanabe
· 1,913 Views
article thumbnail
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
Building a Slack bot with traditional APIs led to 400 lines of code. Using MCP and AWS Bedrock reduced complexity, enabling scalable, tool-driven automation.
June 3, 2026
by Sangharsh Agarwal
· 2,063 Views · 1 Like
article thumbnail
Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
Glue failures scatter evidence across logs, metadata, and table state. A triage layer pulls it together and flags whether a rerun is safe.
June 2, 2026
by Vivek Venkatesan
· 1,786 Views · 1 Like
article thumbnail
Implementing Observability in Distributed Systems Using OpenTelemetry
Instrument a Python Flask service with OpenTelemetry auto trace requests, export metrics to Prometheus, and inject trace IDs into logs for observability in one setup.
May 29, 2026
by Mugunth Chandran
· 2,676 Views · 1 Like
article thumbnail
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
Learn how to build an ETL pipeline with human-in-the-loop approval that costs nothing while waiting — and see real cost data from processing 1,000 documents.
May 28, 2026
by Harpreet Siddhu
· 3,845 Views · 1 Like
article thumbnail
Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me
Setting up a data catalog isn’t just a tool problem. My work with Azure Purview and Collibra showed success depends on governance, metadata, and adoption.
May 27, 2026
by Kuladeep Sandra
· 3,502 Views
article thumbnail
When Perfect Data Breaks: The Journey from Data Quality to Data Observability
Data quality checks often miss silent failures. Use data observability to monitor data in motion and catch issues traditional tools miss.
May 25, 2026
by Divyakumar Savla
· 1,658 Views
article thumbnail
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One SQL query across 4 GPU nodes found a straggler in under a second using eBPF fleet fan-out, no central collector needed.
May 25, 2026
by Ingero Team
· 3,455 Views
article thumbnail
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch
Three AWS managed databases, three dashboards, and one cascade you can only trace by hand. This guide fills the gap CloudWatch leaves open.
May 22, 2026
by Damaso Sanoja
· 3,648 Views · 1 Like
article thumbnail
Architecting Petabyte-Scale Hyperspectral Pipelines on AWS
Learn how to overcome serverless bottlenecks to process and route petabyte-scale hyperspectral agricultural data on AWS.
May 21, 2026
by Anil Bodepudi
· 3,243 Views
article thumbnail
Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs
SAP cloud TCO is driven more by landscape sprawl than by EC2 costs; optimize environments and use Terraform, S3, and EFS lifecycle policies to reduce costs.
May 20, 2026
by Deepika Paturu
· 2,384 Views
article thumbnail
Manual Investigation: The Hidden Bottleneck in Incident Response
Learn about why engineers are stuck investigating instead of fixing and how AI is changing the investigation process for modern systems.
May 18, 2026
by Brian Kaufman
· 1,243 Views
article thumbnail
We Went Multi-Cloud and Almost Drowned: Lessons From Running Across AWS, GCP, and Azure
Multi-cloud sounds strategic, but usually happens by accident. Networking, IAM, and observability all break at boundaries. Only attempt it if you have no choice.
May 18, 2026
by Pruthvi Raj Seknametla
· 1,688 Views
article thumbnail
Observability in Spring Boot 4
Bridge observability gaps in Spring Boot 4 by injecting Micrometer Trace IDs via SQL comments and propagating context through Kafka.
May 15, 2026
by ha dinh thai
· 2,411 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×