OpenTelemetry GenAI spans in agent workflows: trace every tool call to find latency loops and token-cost hotspots

Agent failures rarely look like a single slow request. They look like loops: plan, call tool, retry, summarize, call another tool, exceed budget. OpenTelemetry’s GenAI semantic conventions give developers a portable way to trace those steps with model, token, workflow, and tool-call attributes.

The short version

  • Use one trace per agent task or user request.
  • Create spans for workflow invocation, model calls, retrieval, and tool execution.
  • Record token usage with gen_ai.usage.input_tokens and gen_ai.usage.output_tokens when available.
  • Use gen_ai.operation.name=execute_tool for tool spans and include tool name and call ID.
  • Treat prompt, arguments, and results attributes as opt-in because they may contain secrets or PII.

What OpenTelemetry gives you

OpenTelemetry is useful here because agent telemetry should flow through the same collectors, exporters, and backends as the rest of your system. The GenAI semantic conventions define attributes under gen_ai.* for model calls and agent workflows.

The OpenTelemetry docs currently mark parts of the GenAI agent and framework span conventions as development status, so expect changes. Still, the direction is clear: standard names for operations such as chat, embeddings, retrieval, execute_tool, invoke_agent, and invoke_workflow; attributes for model names, token counts, tool names, and optional messages.

Trace shape

A useful trace hierarchy:

agent.request  user asks: "fix failing test"
└─ gen_ai.invoke_workflow  repo-debugger
   ├─ gen_ai.chat  plan
   ├─ gen_ai.execute_tool  read_file
   ├─ gen_ai.execute_tool  run_tests
   ├─ gen_ai.chat  interpret failure
   ├─ gen_ai.execute_tool  edit_file
   ├─ gen_ai.execute_tool  run_tests
   └─ gen_ai.chat  final response

This immediately shows loops and hotspots. If run_tests dominates latency, optimize test selection. If repeated chat spans dominate cost, inspect planner behavior. If the agent calls the same retrieval tool twenty times, add caching or a better stopping rule.

Minimal Python instrumentation

Assumptions: Python 3.11+, OpenTelemetry SDK installed, pseudo agent functions.

from opentelemetry import trace

tracer = trace.get_tracer("agent.workflow")

async def run_agent(task: str):
    with tracer.start_as_current_span("agent.request") as root:
        root.set_attribute("app.workflow", "repo-debugger")
        root.set_attribute("enduser.id", "redacted-or-hashed")

        with tracer.start_as_current_span("gen_ai.invoke_workflow") as span:
            span.set_attribute("gen_ai.operation.name", "invoke_workflow")
            span.set_attribute("gen_ai.workflow.name", "repo-debugger")
            return await plan_and_execute(task)

async def call_model(model: str, messages: list[dict]):
    with tracer.start_as_current_span("gen_ai.chat") as span:
        span.set_attribute("gen_ai.operation.name", "chat")
        span.set_attribute("gen_ai.request.model", model)
        response = await llm.chat(model=model, messages=messages)
        span.set_attribute("gen_ai.response.model", response.model)
        span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
        return response

async def execute_tool(name: str, call_id: str, arguments: dict):
    with tracer.start_as_current_span(f"gen_ai.execute_tool {name}") as span:
        span.set_attribute("gen_ai.operation.name", "execute_tool")
        span.set_attribute("gen_ai.tool.name", name)
        span.set_attribute("gen_ai.tool.call.id", call_id)
        # Arguments are opt-in. Prefer redacted summaries.
        span.set_attribute("tool.args.keys", sorted(arguments.keys()))
        return await tools.call(name, arguments)

Do not record full prompts, tool arguments, or tool results by default. OpenTelemetry’s conventions include optional fields for messages, tool definitions, arguments, and results, but those can contain credentials, customer data, or source code.

Cost and latency queries

Once spans are emitted, build dashboards around:

  • Total task latency by workflow and outcome.
  • Model latency by provider and model.
  • Input/output tokens per workflow.
  • Estimated cost per trace.
  • Tool latency by tool name.
  • Retry count and loop count.
  • Error rate by operation name.

Cost is usually not emitted directly by providers. Compute it in your telemetry pipeline or backend from model and token counts using a pricing table you update. Keep the price table versioned; provider pricing changes.

Detect loops

A loop is often visible as repeated spans with the same operation and similar attributes. Add counters:

  • Number of model calls per trace.
  • Number of tool calls per trace.
  • Number of repeated calls to the same tool with same argument fingerprint.
  • Maximum trace duration.
  • Budget-killed tasks.

When a limit trips, annotate the trace:

span.set_attribute("agent.stop_reason", "tool_loop_budget_exceeded")
span.set_attribute("agent.loop.iterations", 12)

This makes failures searchable instead of anecdotal.

Gotchas

GenAI semantic conventions are evolving. Pin your instrumentation package versions and document which convention version your dashboards expect.

High-cardinality attributes can hurt observability backends. Do not put raw prompts, full file paths for every temp file, or arbitrary URLs into indexed attributes. Use events or redacted blobs with sampling if you need deep debugging.

Security matters. Tool arguments can contain API keys; tool results can contain PII. Default to summaries, hashes, and opt-in capture for short-lived debugging sessions.

For the live control side of agent work, we have been building Grass. Traces explain what happened and where latency or cost went; Grass lets you start, watch, and control coding-agent sessions from iPhone/iPad while they run on a managed VM or your own machine. You can approve tool calls, review diffs, and resume long-running work away from your desk.

Try it at https://codeongrass.com.

Conclusion

Instrumenting agent workflows with OpenTelemetry GenAI spans turns opaque agent behavior into traces you can debug. Start with task, model, retrieval, and tool spans. Add token counts and latency. Keep sensitive payloads out by default. The payoff is fast: you will see cost hotspots, tool loops, slow calls, and failing steps in the same observability stack you already use.

Sources

  • OpenTelemetry GenAI span and metric semantic convention docs
  • OpenTelemetry GenAI registry attributes
  • Microsoft Azure agent observability examples using OpenTelemetry