Telemetry First: A report on Troubleshooting Latency and Assignment Issues in D365 Contact Center

Modern contact centers are sprawling systems. Even when product consolidated as a CCaaS, you might see Azure PSTN calling, direct routing via an SBCaaS and carriers forwarding SIP to Azure-hosted numbers for ACS (yes, I’ve worked with customers using all three infrastructures simultaneously). On top of that, there’s live chat, Exchange-synced email, Copilot Studio agent-run interactions and SMS via Twilio. Plus the client itself; the Copilot Service Workspace app where agents actually interact.

With so many moving pieces, users report a familiar mix of symptoms: slow form loads (e.g., the active conversation form), delays setting up two-way audio when a call is accepted or transferred, and inconsistent audio quality/volume as a caller moves from IVR to queue music to a live agent. The question is not if issues will happen, but how you’ll detect, isolate and fix them quickly.

This post shares the structured approach my colleague and I used when we did a focused performance analyses for one of our customers. Grounded in real telemetry the intention was to build the foundation and method for a durable, long-term monitoring strategy.

What users told us and what the data actually showed

Across teams, users described the same experience in slightly different words: “the app freezes when I open a case,” “tabs are sluggish,” “calls take too long to connect,” and “chat arrives late or stalls.”

In day-to-day terms, that meant 15–25 seconds to open a case with the screen unresponsive, 10–20 seconds every time they switched tabs, and long waits (sometimes up to 15–20 seconds) for an incoming call to truly connect. Chat sessions could appear, then hang before the first message became visible. These weren’t isolated anecdotes; multiple groups reported similar patterns.

When we mapped everyday workflows, users also called out slow “bread-and-butter” actions: assigning cases, send & close from the email form, bulk management from a list view and waiting for the timeline to hydrate before they could expand items. These are small moments that add up when you handle dozens of records per shift.

So – was the system really as slow as it felt? In short: yes, mostly. Our measurement window (two weeks) validated the core complaints:

1. Case management times skewed slow. The agreed upon threshold was set to 2,5 seconds per one single action. A bulk assignment of multiple case records or merging of duplicate records could peak to 15-25 seconds, which would in return drive frustration.

2. Tab switches and conversation availability were better than case opens but still lagged user expectations; users users felt the drag, and the data did back that up.

3. Incoming calls showed real delays from accepted call request to an established audio/full connect, consistent with field reports.

Two potential root-causes were immediately found behind the scenes. The first was a “tail” problem on the network. While median throughput looked acceptable, the 95th-percentile latency spiked to ~1,974 ms during the study window. This is fine for email but painful for real-time voice/chat. That explains why experiences felt inconsistent: most sessions were okay, but the bad ones were really bad.

The other was over-strict assignment rules. We observed repeated assignment attempts (a dozen in one trace) failing to find a matching agent because language/skill criteria were too tight. To a user, that looks like “the call/chat is stuck,” even though the system is busy trying, and failing, to connect it.

A few reports were mixed or environment-dependent. For example, some colleagues felt forms were slower off-site than on-site, and Teams components occasionally took longer to load, which are signals that device, browser, or VPN conditions also play a role. But taken together, the telemetry corroborated the main story: case forms and key actions were slower than they should be, conversation setup was uneven, and a small set of routing and network tail issues amplified the pain on bad days.

A quick note on expectations

What needs to be noted in relation to the analysis is the objectiveness of users. Many performance expectations come from prior experience with lightweight ticketing tools. A native CRM like D365 brings more context at runtime; security and entitlement checks, related tables, timeline hydration, knowledge retrieval, analytics hooks and sometimes plug-ins which can add a little time to initial loads or tab switches. In return, agents get a richer customer picture and fewer swivel-chair hops, which typically improves quality and first-contact resolution.

The goal isn’t “sub-second everything,” but consistently responsive experiences with clear SLOs that reflect the value gained from deeper context (e.g., smooth case opens and predictable navigation, even if not as snappy as a minimal ticketing UI). Setting expectations in parity with what’s gained helps organizations judge the system fairly and focus on the worst-tail experiences that truly hassle users.

OUR APPROACH

Based on the reported problem scenarios, we ran a focused, two-week study with selected teams, replaying real agent workflows while capturing telemetry in parallel. Sessions were observed live (two agents per unit), and we limited scope to four areas: network performance, conversation assignment, app/form load performance and “everyday” business scenarios (assign, send & close, merge, timeline).

What we measured & the tools we used

We combined product telemetry with targeted diagnostics using the following toolset:
– Azure Application Insights
– Azure Data Explorer
– Power Apps Monitor
– Azure Communication Services Call Diagnostics/Logs
– Dataverse Routing Diagnostics (deprecated but still useable when this is published)
– Telemetry Insights in the D365 Implementation Portal.

Key signals included network throughput (P50) and latency (P95), assignment attempts and presence, form and conversation load times, User Facing Diagnostics (UFD) events, and media quality (e.g. RTT, jitter, packet loss).

what the data showed

Network tail risk
Median throughput was acceptable, but latency tails were high: P95 latency peaked at 1,974 ms with a median around 351 ms, above what’s comfortable for real-time voice/chat—even if “typical” sessions looked fine.

Two network metrics are especially useful across all channels:

Throughput (NW_Throughput_P50_Mbps) = your median data rate. In one study, a median of ~9.5 Mbps looked acceptable against a common 4 Mbps minimum, but it masked a tail of users below target. Median alone isn’t enough.

Latency (NW_Latency_P95_ms) = your 95th percentile round-trip time. We observed a p95 as high as 1,974 ms (with a median around 351 ms) over the review window. Clearly problematic for real-time voice, where <100 ms is ideal and <200 ms may be tolerable.

These metrics are native in the telemetry ruleset that comes with D365 Implementation Portal’s Telemetry Insights report. If you’re not already using it, I highly recommend looking into that. I also posted this Where’s the Bottleneck? -Telemetry Insights Might Know – CONTACT CENTER CHRONICLES back in February, detailing the contents and usage of the service if you want to know more.

Reflection: the system wasn’t uniformly “slow,” but the tails were unacceptably high for real-time workloads. That’s why users reported inconsistent experiences; median users were fine whereas tail users had a bad day. The takeaway for us was that we need to track p50 and p95 (and often p99) side by side. Voice and chat suffer in the tail even when the median looks fine. The same thinking applies to app telemetry: pair average form load with p95 to see the “bad day” experience, not just the typical one.

Assignment friction explained
A case study of one conversation showed 12 assignment attempts failing to find an eligible agent due to presence changes and strict skill/language criteria; to users this feels like “the system is stuck.” The call orchestration traces showed repeated assignment attempts that couldn’t find an eligible agent, often due to strict skill/language criteria and shifting presence states. To users this felt like “the system is stuck” or that the routing model in use isn’t performing.

In one example, the system tried assignment 12 times before giving up which is exactly the kind of “nothing is happening” moment agents perceive as a stuck system.

Reflection: The platform behaved as configured; the operational design (skills, language, presence volatility) created tight funnels that were impossible to surpass when no agent was available. A deep dive into the status history of agent statuses showed us that users statuses are frequently switched to Offline, with no reasonable explanation. When mapping the conversation assignment time stamp against the status log we could quickly see a correlation between the time of the failed assignment and the Offline activity in the status log.

Some of the retry loops we observed could potentially stem from presence flips to “Offline” caused by transient network outages or server-access glitches. When the system can’t reliably confirm that an agent is logged in, it won’t route to that agent and will reattempt assignment elsewhere, making it look like the queue is “stuck” even though the engine is retrying. In our report to the customer we recommended monitoring unexpected Offline transitions and correlating them with network/IDP/service health during incident windows.

Forms vs. conversations
In the scenarios we reproduced, no abnormal load times were observed for the case and conversation forms; most timings landed in a reasonable band once we accounted for content and timeline volume.

For example, merging two empty cases typically took 3.6–5.1 s, but ~14 s when each case included email content. This was evaluated to be expected and within reason, given the re-render and activity merge. Timeline hydration after open averaged ~2.3 s, scaling with activity count. Where individuals still felt “it’s always slow,” our report recommended individual diagnostics (device, browser, VPN/site, local network) to further pinpoint individually perceived issues.

In our follow-up report we noted a potential Exchange linkage avenue to check for perceived delays around “send & close” patterns. Not a confirmed fault, but a sensible place to inspect if users report email-related pauses.

Reflection: The recreated flows were generally healthy ad our conclusion was that persistent “slow” reports are likely environment-specific (endpoint/network) or content-driven (very heavy timelines), rather than a systemic app regression.

Voice quality signals
UFD events were rare (26 events across 1,474 calls ≈ 0.08%). Media metrics largely stayed within healthy thresholds (RTT <500 ms, jitter <30 ms, packet loss <10%), with some isolated jitter peaks worth watching.

We also did some scenario timings based on user reports where the following loading times were captured:

Assign case: ~7 s manually; ~2.3 s for auto-assign; mass assign ~1.03 s per case.

Send email from timeline: perceived delay tied to database write + Exchange sync; no abnormal app regression observed.

Merge two cases: 3.6–5.1 s (empty) and ~14 s (with emails), as expected given content.

Timeline load after open: ~2.3 s on average; scales with activity volume.
Overall, no abnormal spikes in these reproduced scenarios; outliers could likely reflect local conditions like type of device, browser, VPN, site, etc.

our recommendations

1) Treat the tail. Track and act on p95/p99, especially site-segmented latency, rather than just medians. Use Telemetry Insights/App Insights to identify high-latency users, then validate with traceroute/client diagnostics and A/B tests (office vs. remote).

2) Tighten assignment design. Observe live in the voice channel to confirm whether declines are manual vs. implicit timeouts; review presence changes; right-size skills/language filters; verify capacity using Omnichannel capacity profiles (not legacy device-based fields).

3) Tune forms before code. Compare incident vs. conversation form performance by department; reduce on-load JS, defer heavy subgrids/attachments to tabs, optimize lookups/Quick Views, and review plug-ins/legacy workflows for synchronous hotspots.

4) Measure the journey end-to-end. Keep scenario dashboards for assign, send & close, merge, and timeline hydration; tag changes (deployments/routing/IVR) to correlate regressions; separate conversation vs. incident loads in reporting. Use Fiddler/TTFB tests when VPN is involved.

5) Blend tech data with human feedback. Run a four-week follow-up where agent feedback (and a short caller survey) is captured alongside telemetry; make feedback effortless with automatic timestamps/IDs in reports.

6) Mind the platform factors. Review batch schedules, datacenter placement constraints, and ACS global distribution effects; where feasible, reschedule heavy jobs and revisit form/business-rule scope to limit load on the critical path.

Bottom line: users were right about slowness in key moments, especially case loads and uneven conversation setup, but the biggest wins will come from reducing tail latency, easing over-strict assignment rules and trimming form load paths, with continuous measurement to lock in gains.

The observability model used

What we tried to keep in mind when conducting this analysis was to think of the contact center as four observable layers. When something’s slow or crackly, decide which layer to test first and then move outward.

1. Agent app & workflows (D365 Copilot Service Workspace): page and form loads, timeline hydration, tab switches, assignment/orchestration steps, plug-ins, and client extensions.

2. Conversation orchestration: call/chat allocation, queue transfer, IVR/Copilot Studio prompts, bot handoffs.

3. Media & network: client device/OS/browser, Wi-Fi vs. wired, VPN, LAN/WAN, SBCs, carrier peering, codec policy, packet loss/jitter/latency.

4. Upstream/downstream services: Exchange sync, ACS/Telephony, Twilio, knowledge/citations, data APIs (read-only order context, etc.).

Your monitoring should tag events and timings by these layers so you can localize a slowdown (e.g., “tab change in case form >10s” vs. “call connect time >8s” vs. “p95 network latency spike”). In our recent assessment, we saw long case-form loads and inconsistent conversation timings confirmed by both user reports and telemetry; clear signals to inspect layer 1 and 2 first. This also set the scope for our analysis.

Quick wins we shipped

Timeline & form tuning: paginate timeline; trim heavy controls on case forms; move non-critical lookups off the critical path. Agents reported faster case work immediately.

Queue/assignment hygiene: relax overly strict skill rules (e.g., language) or provide a fallback profile to avoid repeated assignment failures.

Network tail hunt: isolate office vs. remote; flag sites with p95 latency >200 ms; work with ISP/network team on QoS for real-time media; validate ports and packet prioritization; consider carrier peering improvements.

Change tagging: annotate app releases and routing changes to stop “mystery regressions” and speed up root cause analysis.

Agent guidance: publish a one-pager: supported browsers, tab discipline, restart cadence, wired preference, and how to report repro steps with timestamps.