We had tested 3CX in multiple virtual server environments (2008 Hyper-V as Host OS and 2003 Server R2 as Guest OS with 3CX installed), and never had any problems, until today when we implemented 3CX into our own company environment to replace our previous IP PBX (a project on which I was the lead). It's a long explanation but well worth reading so you don't get caught out by the same thing.
Our environment mirrored our test environment with the host server running 2008 Hyper-V and the guest OS for our 3CX server was Server 2003 R2 Std Edition. All was fine when I migrated last night but I did notice a sharp spike in CPU (100%) when McAfee updated itself at 5pm. I thought nothing much of it until I checked again this morning and noticed McAfee was still using a lot of CPU (this is with the program files and all users 3CX folders excluded from scanning). When I mentioned this to a colleague he said he would just allocate another CPU to the virtual server, which we did.
Everything appeared to be fine but after an hour or so call quality started to degrade. This only affected voice from the 3CX server to phones, people we called could hear us fine, but it was their voice to us that was poor. LAN to LAN calls, calls to other sites across our VPN and calls via our ISDN gateway were fine. The issue only appeared to affect calls via our VoIP provider that were in effect using the 3CX server as a gateway.
After looking at bandwidth and various other possible factors (all the while whilst having a colleague building a dedicated physical server to restore 3CX to if necessary) voice quality had gotten so bad we rebooted the server. After the reboot everything was fine again? However,sure enough over time the issue returned and calls started getting worse again.
We had never had any problems with our previous VoIP system and the LAN and phones had not changed, only the back end PBX platform had altered. We then noticed some VERY odd behaviour on the 3CX server. Constant pings to the server from a PC on the LAN returned rock solid sub 1ms response as you would expect. Pings from the server to the same PC on the LAN would show
<1ms
<1ms
<1ms
625ms
<1ms
625ms
<1ms
<1ms
625ms
<1ms
Pings from the server to it's own IP also showed the same fluctuation? We could see the behaviour of ping traffic from the server corresponded to our symptoms of poor voice being sent from the server but couldn't figure out why? None of our other virtual servers running the same OS on the same host exhibited the same behaviour? None of our testing had revealed this issue? At this stage it was 4pm and our telephone based support and sales team were none too happy with the migration as you can well imagine! It was then that a colleague of mine came across this article regarding Server 2003 Guest virtual servers using multiple CPUs
http://techblog.mirabito.net.au/?tag=usepmtimer
The symptoms were that my Windows Server 2003 machine would return very strange results when pinging hosts, both internally and externally, such as returning all four responses within about half a second, yet measuring them at over 3000ms (which means they should have timed out, rather than given me a reading in milliseconds) as well as occasionally providing negative values for response times.
Obviously the results were completely inaccurate, but I couldn’t work out why the issue was only happening on a handful (not all) Hyper-V guests running Windows Server 2003 and none on Server 2008.
Turns out that this is an issue if all of the following are true:
You are running an operating system prior to Windows Vista or Windows Server 2008
You are running the current implementation of Microsoft Hyper-V (i.e. at the time of writing)
You have presented multiple processors to the Hyper-V guest
The issue occurs because the multiprocessor HAL in Hyper-V causes the guest’s operating system Time Stamp Counter (TSC) to skew. According to this blog the problem wouldn’t ordinarily occur if you were running Windows Server 2003 with SP2 unless the BIOS check fails to determine if the TSC should be used. More specifically, if I understand correctly the issue occurs because the processors (or cores, if we’re talking about a single multicore processor) are not in sync with each other, which produces sporadic out-of-time results where time sensitive operations (such as ping responses) are in use.
Jackpot!
This was the only virtual server we had with multiple CPU's hence why we had not seen it before. To prove the theory we took a non critical virtual server and assigned it multiple CPUs and set up a constant ping to itself. Sure enough after 4 or so minutes ping responses to itself (that had previously been rock solid <1ms) showed odd behaviour. In fact some the of the ping responses showed
negative response times!. How something can respond to a ping before it's even been sent I'm not quite sure? Either that or we've figured out time travel!?
Following the articles recommendaion we put the \usepmtimer in the boot.ini of the guest 3CX server and rebooted and the problem seems to have disappeared. We finally got the system back to an even keel at around 5pm, by which time my nerves were shredded and Christmas bonus shrunken!
The behaviour we saw was a result of the CPUs being in sync after a reboot but gradually drifting further and further apart.
I hope people can learn from my experience and be sure to enter this switch if you are planning on using Server 2003 R2 as a guest OS for a multi CPU virtual machine on 2008 Hyper-V. After all the stress it was good to finally get to the root cause, and it's probably one we would have never even noticed had the platform not been running a time critical application such as 3CX, but there's
got to be less stressful ways of learning these things! :lol: