yesterday I managed a very strange behaviour on a customer’s Skype for Business deployment that I want to share with you, maybe it could help someone.
One SfB Front-End Standard (updated to CU5) on Windows Server 2008R2
One SfB EDGE (updated to CU5) on Windows Server 2008R2
One Sonus SBC 1000 as PSTN Gateway
Enterprise Voice, Audio Conferencing, RGS, Call Park for Pickup enabled
Yesterday at a one point, the Mediation Service, Call Park Service and Response Group Service start to crash, not togheter at the same time but with few seconds between them. If we restart the crashed service, it continue to work for few seconds, then it crash again!
Only these three services were involved, no others, for example Front-End Service continued to work perfectly.
Event Viewer Errors
On the Frond-End server we have these two connected errors on Application Event Viewer for every service that crashed.
These are the errors for Mediation Service
Log Name: Application Source: Application Error Date: 1/3/2018 3:04:06 PM Event ID: 1000 Task Category: (100) Level: Error Keywords: Classic User: N/A Computer: frontend.domain.lan Description: Faulting application name: MediationServerSvc.exe, version: 6.0.9319.272, time stamp: 0x57ff4069 Faulting module name: Microsoft.Rtc.Internal.Media.dll, version: 6.0.8953.265, time stamp: 0x58c2fe98 Exception code: 0xc0000005 Fault offset: 0x0000000000388362 Faulting process id: 0x35b0 Faulting application start time: 0x01d3849944f869f2 Faulting application path: C:\Program Files\Skype for Business Server 2015\Mediation Server\MediationServerSvc.exe Faulting module path: C:\Windows\Microsoft.Net\assembly\GAC_64\Microsoft.Rtc.Internal.Media\v4.0_126.96.36.199__31bf3856ad364e35\Microsoft.Rtc.Internal.Media.dll Report Id: f56561c7-f08e-11e7-b798-005056a143b5
Log Name: Application Source: .NET Runtime Date: 1/3/2018 3:04:04 PM Event ID: 1026 Task Category: None Level: Error Keywords: Classic User: N/A Computer: frontend.domain.lan Description: Application: MediationServerSvc.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: exception code c0000005, exception address 000007FED9EB8362
The root cause
Long story short, after many (many!) different test on the Front-End (update to CU6, update .NET, installed a brand new Front-End) without any success, I turned my attention to the EDGE (not clear why I do not do that before, but that’s the story), and I found this Event Error:
Log Name: Lync Server Source: LS A/V Edge Server Date: 1/3/2018 9:29:20 PM Event ID: 22032 Task Category: (1028) Level: Error Keywords: Classic User: N/A Computer: edge.domain.lan Description: The system is low on non-paged memory. LS A/V Edge Server will start dropping packets. Cause: The system needs more non-paged memory to handle the current work load. Resolution: Increase the size of the non-paged memory.
On the EDGE I found RAM full, after a restart everything start to work fine again.
I suppose this issue is related to MRAS and media flow candidate identification via ICE protocol, what is strange is the effect on the services on the Front-End!
I do not know why they crashed instead of simply goes in time-out during candidate search.
Remember that you can define IF and WHICH EDGE Server is associated to your Front-End servers as media flow candidate.
The defined EDGE Server will be always used in every calls, inbound and outbound, PSTN related or not.
If you are interested in the Media Flow process and in ICE-TURN-STUN protocol I suggest to watch this great video from Ignite:
Troubleshoot media flows in Skype for Business across online, server and hybrid
I hope this short article could help some of you.