Solved Trunk Keep-Alive stopped working

Discussion in '3CX Phone System - General' started by JST, Mar 14, 2018.

Thread Status:
Not open for further replies.
  1. JST

    JST New Member

    Joined:
    Jan 8, 2017
    Messages:
    107
    Likes Received:
    1
    We suddenly encounter 403 forbidden errors on outgoing (and potentially incoming calls). When I called our service provider (Broadvoice/PhonePower), they found that our system is no longer sending keep alive requests. I believe this problem has started with the recent upgrade to 15.5.9348.3 (Linux).

    When I checked the trunk configuration, I was surprised to see that it doesn't even have an option to configure a keep alive value (e.g. 60 seconds).

    Basically, the registration with the provider eventually fails resulting in us no longer being able to make phone calls (or receive calls).

    Does anybody have the same problem? Any solution?
     
  2. YiannisH_3CX

    YiannisH_3CX Support Team
    Staff Member 3CX Support

    Joined:
    May 10, 2016
    Messages:
    4,864
    Likes Received:
    320
    Hello @JST

    Please note that by default we do not send keep alive requests to providers and we never did. In fact Broadvoice should send keep alive messages to you. Perhaps they meant your system does not reply? As you can see the originator of the Keep Alive is Broadvoice. 3CX replies back with 200 OK.
    2018-03-15_12h00_31.png
    403 forbidden sounds like the number you are presenting is not correct or your public IP changed and they are seeing invites from a wrong IP.
     
  3. JST

    JST New Member

    Joined:
    Jan 8, 2017
    Messages:
    107
    Likes Received:
    1
    Hi @YiannisH_3CX

    Your explanation makes sense. I was assuming that the problem is on their end. That's why I called them first.

    So, let's assume that my system doesn't reply to their request. What that be a firewall problem or could there be an issue with 3CX not responding? The number is definitely valid and my IP (phone endpoint) rarely changes (every couple of months?).

    I am using Cisco Meraki MX to secure our home. It is using a VPN connection to a remote site hosting the phone system. That site is using a Sophos UTM with a static IP. There has been no change on the remote site, but there was a firmware upgrade on the local Meraki MX. I can try a packet capture on the MX, but since it is just using a simple VPN to connect to the remote location, it is doubtful that it is a firewall problem.
     
    #3 JST, Mar 15, 2018
    Last edited: Mar 15, 2018
  4. JST

    JST New Member

    Joined:
    Jan 8, 2017
    Messages:
    107
    Likes Received:
    1
    OK. I have called Broadvoice / PhonePower again today and they did a packet capture for me. As part of this exercise, support has asked me to increase my re-register time to 3600 which I did. They also confirmed that their system receives the correct requests and that my system confirms the keep alive.

    This worked for a few hours, but now I am back to the same problem and every time the only option is to reboot the 3CX system.

    I also checked on my end and found that there has been no firmware upgrade and configuration change on the Sophos appliance in front of the 3CX appliance. Broadvoice / PhonePower also confirmed that there has been no change on their end. Looking at their log when the process failed, they confirmed that 3CX was no longer registered on their service. Apparently, my system thinks otherwise and so it is attempting to make the call resulting in a 403 message.

    What has changed? I have checked pretty much everything and the only thing that actually changed is 3CX itself when I installed the most recent upgrade about 2-3 weeks back. Basically, it seems that the patch did something to registration feature that makes it stop working after a certain time. Since this apparently doesn't happen for everyone, I am guessing that there might be other factors at play.

    Any suggestions?
     
  5. YiannisH_3CX

    YiannisH_3CX Support Team
    Staff Member 3CX Support

    Joined:
    May 10, 2016
    Messages:
    4,864
    Likes Received:
    320
    First of all the supported Broadvoice service is IP based so no registration is required. Do you have a register based Broadvoice trunk? If so please note that this has not been tested from our side and we cannot confirm the configuration details.
    Having said that you should reduce the re-registration to a normal value. Having the value to 3600 means that if the connection is lost within that hour the PBX will not be aware of it.

    Now to troubleshoot your issue you will to run a wireshark capture on the PBX server at the time of the failure and leave it running until re-registration should occur. Note that it is the provider that sets when the re-registration will take place by accepting the value you are sending or suggesting a new value in their 200 OK message. Again i am not aware which is the case with Broadvoice as we are using an IP based trunk. Using the capture you will be able to determine what is happening.
    Also what is the registrar you are using to connect to Broadvoice?
     
  6. JST

    JST New Member

    Joined:
    Jan 8, 2017
    Messages:
    107
    Likes Received:
    1
    Thank you for your reply.

    PhonePower is owned by Broadvoice. It handles their private phone services in the US. My 3CX server is sitting in a hosted environment on a 100/100 connection. It handles the business calls for my company using another provider (no issues there). Since my company is fairly small, I am also using the 3CX system to serve my home which is connected to the hosted environment using a VPN connection (24x7) on a business internet connection.

    PhonePower support told me that their default value for re-registration is 3600. I agree that this is quite high and understand that there might be a disconnect for an hour if that connection fails. We previously had this set to 600. Would you recommend a lower value?

    When we did a packet capture we can see all relevant messages (eg. 200 OK message) and support confirmed that everything looks good when it initially registers. Once registered, re-registration also seems to work until our 3CX system eventually fails to re-register. Support confirmed that by checking their log and not seeing a re-register request. So, it seems that our 3CX box works for a while and then it kind of goes to "sleep"...

    On the connection side, the internet connection is provided by the data center, the DNS information for the 3CX system is handled by GoDaddy and the whole environment is secured by Sophos SG appliance. Our 3CX system is Linux based and it is sitting on a VMware server.

    I am also quite familiar with the setup and configuration of firewall and PBX hardware. So, we already did the packet capture.

    What we are seeing is that our 3CX system fails to re-register at the agreed upon time. While this doesn't always happen it lately happens at least once in a 24 hour period. I also found that the 3CX system shows the trunk as registered even though it no longer is.

    One more thing: Their support was wondering why 3CX doesn't send any re-registers during an active call. I mean you are obviously registered during a call, but I don't know what the standard is for that. It seems that the support person was wondering what would happen with the re-register timer during an active call since this seem to happen more likely if there are more calls.

    Do you know how that part works? I am basically wondering how re-registration works. Here are some of my questions and assumptions:
    - Re-registration is set to 600 and takes place 2 minutes prior to the first call
    - The first active call takes 4 minutes, followed by a second call taking 30 minutes
    - When is re-registration expected to happen? Every 600 seconds regardless of call in progress or 600 seconds after any active call ends? If it is the first option, then our 3CX system has an issue because based on the packet capture, it seems to do the second option.

    It should be noted that anybody using 3CX with a low re-register value might not notice the disconnect and some providers might also be more relaxed on the registration process.

    I hope this makes sense. While I am familiar with the configuration of PBX systems, I am not an expert when it comes the actual protocols and services.

    Please let me know if there are any other logs that might be useful to debug this issue.
     
  7. lneblett

    lneblett Well-Known Member

    Joined:
    Sep 7, 2010
    Messages:
    2,063
    Likes Received:
    57
    A few items -

    Something is amiss. To Yannish's points -
    Is the trunk a peer trunk or a register trunk?
    If a peer trunk, then the trunk status will always show green (registered).
    If a register trunk, then it depends. When 3CX sends the register to the provider, there may be an expiry header/request. This is the time period that 3CX is requesting that be allowed before the next registration occurs. The provider can elect to accept the requested expiry or enforce their own as they may want to do this in order to prevent those that might think it needed to register every minute thereby causing unwarranted traffic. Now then, once the register period has been agreed upon and established, neither side is expected to contact the other until such time as the expiry occurs. In essence, if nothing else occurs in the meantime, you could literally take the connection down and neither side would know, 3CX would show green, and both would think the other to still be available. Only when a message was sent from one or the other and no response was received would the issue come to light.

    Calls and registrations are independent of one another. A call starts with an INVITE and is only sent when the registration is thought to be active if using a register trunk. If not registered or no response, the call will not be sent and depending on the arrangement, the provider may forward to a disaster recovery number, send to the provider's voice mail for the account or provide the caller with some indication that the call cannot be completed. The actual message is the one provided by the caller's carrier and how they see fit to interpret the reason code. If a peer, the call will be sent anyway and the call will either timeout waiting for a response or will receive the response and react accordingly. If a timeout, then the same handling should be done by the provider as that for no response above.

    A REGISTER is merely a request to let the server know that the PBX is there, how to contact it and when a new REGISTER is expected. The same happens when a phone registers to 3CX. Once done and accepted, then everyone can rest easy for awhile and not clog the networks with extraneous data.

    It would be helpful to post the same captures (sans any public IP info or other sensitive) so that we can see what the messaging is doing or not.
     
    JST and YiannisH_3CX like this.
  8. YiannisH_3CX

    YiannisH_3CX Support Team
    Staff Member 3CX Support

    Joined:
    May 10, 2016
    Messages:
    4,864
    Likes Received:
    320
    Please note that what you set in the PBX as re-registration timeout makes no difference if the provider does not accept this value. You could set it to 5 seconds but if the provider has a set value of 3600 that is what will be respected.
    You will need to clarify how the provider works, do they accept whatever is send to them? Do they have a set value or do they increment the value adding to the re-registration time?
    Some providers start from a lower value to see if you are registering reliably to them and slowly increase the re-registration value to a higher number once they establish that the PBX is reliably re-registering on time.
    3CX PBX will re-register to the provider at 90% of the registration value. So if the accept / agreed upon time is 3600 the PBX will re-register after 3240 seconds.

    Having an active call going does not affect the registration of the PBX. If during a call the PBX needs to re-register to the provider it will without that affecting the ongoing call.

    To troubleshoot your issue you will first need to establish what is the agreed re-registration value.
    Start a capture and re-register your trunk. The PBX will send a register message with the set value in the PBX. The provider will then answer with the accepted value in the 200 OK message.

    In the example below the PBX suggests an Expires value of 3600. The provider replies with an Expires value of 60. The PBX will re-register to the provider after 54 seconds.

    Code:
    Session Initiation Protocol (REGISTER)
        Request-Line: REGISTER sip:callcentric.com:5060 SIP/2.0
        Message Header
            Via: SIP/2.0/UDP 192.168.xxx.xx:5060;branch=z9hG4bK-524287-1---e77cf26ef6690e7c;rport
            Max-Forwards: 70
            Contact: <sip:1777xxxxxx@158.xx.xx.xx:5060;rinstance=8990e7621c31146>
            To: <sip:1777xxxxxx@callcentric.com:5060>
            From: <sip:1777xxxxxx@callcentric.com:5060>;tag=f37b8a78
            Call-ID: 8OWfv0yvz_XtyCrTIo1WPg..
            CSeq: 2 REGISTER
            Expires: 3600
            Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REGISTER, SUBSCRIBE, NOTIFY, REFER, INFO, MESSAGE, UPDATE
            Supported: replaces, timer
            User-Agent: 3CXPhoneSystem 15.5.9348.3 (8713)
            Content-Length: 0
    
    Session Initiation Protocol (200)
        Status-Line: SIP/2.0 200 Ok
        Message Header
            v: SIP/2.0/UDP 192.168.xxx.xx:5060;branch=z9hG4bK-524287-1---e77cf26ef6690e7c;rport=5060;received=158.xx.xx.xx
            f: <sip:1777xxxxxxx@callcentric.com:5060>;tag=f37b8a78
            t: <sip:1777xxxxxx@callcentric.com:5060>
            i: 8OWfv0yvz_XtyCrTIo1WPg..
            CSeq: 2 REGISTER
            m: <sip:1777xxxxxx@158.xx.xx.xx:5060;rinstance=8990e7621c31146>;expires=60
            l: 0
    Once you establish the correct value you will then have to wait for the re-registration and see if that is correct. I would recommend running the captures from the PBX to be able to see if the messages are leaving / reaching the server as there might be something else in the network blocking the re-registration and the messages might indeed not be reaching the provider.
     
    JST likes this.
  9. JST

    JST New Member

    Joined:
    Jan 8, 2017
    Messages:
    107
    Likes Received:
    1
    Thank you Yiannis!

    This is exactly the information I was looking for. I also have some feedback for you.

    This is not what the provider is seeing. They have confirmed that they are accepting a large range of values and seem to have a preference for higher values. That said, their technician confirmed that according to his logs the re-register appears to related to making a call. We also tried to put a lower value and didn't see any re-registration attempts during the call (e.g. 120 with me being on the phone for 10 minutes).

    On the other hand, if there is no call, the process seems to work just fine. It also doesn't happen every time I am making a call. Quite frankly, I don't know what triggers it, but it happens fairly consistent within a 24 hours period and almost guaranteed if I am making 2-3 calls in a row (with the next one ending up as 403).

    I am guessing that we need more data to get to the bottom of this. So, I will check if the Sophos appliance can do a capture for a 24 hour period or I will try to install wireshark on another vm. I will also look for any blocked traffic, but I believe that the re-registration is pretty much the same as initial registration.

    The provider has also confirmed that they are seeing the 200 message. In fact, they are seeing all required messages except the re-register.

    I will try to put more information later on this week or early next week.

    Thank you for your detailed information and feedback!
     
  10. JST

    JST New Member

    Joined:
    Jan 8, 2017
    Messages:
    107
    Likes Received:
    1
    Thank you for your detailed information. This is definitely a register trunk.

    As I said before, the provider sees the initial register and also re-registers, but apparently the re-registers stop at some point in time.

    Once that happens, my phone send an invite (seen by the provider) and it is answered with a 403 because based on their status, my system has missed one (or more) re-registers. So, the call fails because my system isn't registered anymore. I am assuming that 3CX is changing the status of the trunk to red if a re-register would fail. Yet, it continuously shows green.

    Another interesting fact: If I don't reboot or simply re-save the trunk, the trunk will eventually register again (less than 30 minutes?).

    Anyhow, like I said in my reply to Yannis, I understand that we need more data to figure this out and since it doesn't look like that the provider can send me their log, I will take a capture on my end.

    Just wondering what it will mean if we do find missing re-registers in that capture?
     
  11. YiannisH_3CX

    YiannisH_3CX Support Team
    Staff Member 3CX Support

    Joined:
    May 10, 2016
    Messages:
    4,864
    Likes Received:
    320
    Try to run capture preferably from the 3CX server and p.m. me if you need any assistance with troubleshooting. Without a capture displaying the issue it is very hard to troubleshoot this.
     
  12. John Kirkwood

    Joined:
    Mar 20, 2018
    Messages:
    4
    Likes Received:
    3
    Hi @JST, @YiannisH_3CX, and @lneblett.
    This is really for JST to let him know his problem is not a one off. Obviously I found this thread after a web search for my issue.
    I am having a similar issue with a newly installed 15.5 on Linux.
    We use Gamma (UK) which is definitely IP based so one would think this should not happen.
    However here it is:
    We got reports that intermittently users could not make outgoing calls. After dialling the number the call "hangs" for a long time before the error "No response" appears on screen.
    I discovered that if the 3cx system/Gamma trunk is idle for 5 minutes with no incoming or outgoing calls, somehow the 3CX is no longer "registered" even though registration should not be required.
    I can now recreate this fault at will and without fail the fault occurs after 5 minutes of inactivity.
    I really have passed hundreds of calls to test this. As long as these calls are made every one, two, three or four minutes the calls go through. If a gap of five minutes passes all outbound calls start to "hang".
    Here is the funny thing though - there is no problem with inbound calls. Obviously Gamma are just sending calls to my public IP so the 3CX is receiving no problem, but if an inbound call presents while the fault is apparent the system seems to "find" Gamma again and outbound calls will now connect fine.
    Would you believe my (clumsy but effective) workaround has been to set a Grandstream GXP2160 to dial a busy number and have it auto redial every 3 minutes. This keeps the Gamma trunk "alive".
    I will post if I get a solution.
     
    JST likes this.
  13. YiannisH_3CX

    YiannisH_3CX Support Team
    Staff Member 3CX Support

    Joined:
    May 10, 2016
    Messages:
    4,864
    Likes Received:
    320
    Hello @John Kirkwood

    I don't think you have the same issue with JST. I suspect that your issue is firewall related and since Gamma is IP based and there is no registration or no keep alive messages after a period of inactivity the firewall blocks the connection.
    If you are familiar with wireshark make an outbound call while calls are working and run a capture on the PBX server.
    Then wait for 5 minutes (while the system is inactive) and repeat the same call while capturing.
    See if the PBX sends anything different. If the PBX is sending the same info during both calls then your issue is further down the line.
     
    John Kirkwood likes this.
  14. John Kirkwood

    Joined:
    Mar 20, 2018
    Messages:
    4
    Likes Received:
    3
    Thanks @YiannisH_3CX
    You are correct and -yet again - our 3CX system was not at fault.
    After your comment (and with further help from your colleagues @IliasL_3CX and @nikosT_3cx)
    I found that the problem was with our (recently firmware updated) Draytek 2860 router.
    I can provide the details if anyone is interested but in the meantime thank you again for your input.
    John
     
    YiannisH_3CX likes this.
  15. YiannisH_3CX

    YiannisH_3CX Support Team
    Staff Member 3CX Support

    Joined:
    May 10, 2016
    Messages:
    4,864
    Likes Received:
    320
    Glad we could help
     
  16. JST

    JST New Member

    Joined:
    Jan 8, 2017
    Messages:
    107
    Likes Received:
    1
    @John Kirkwood Glad you got your problem solved!

    @Yiannis I did some more research and all collected logs indicate that 3CX stops doing re-registering after some time. Sadly, it is still not clear what's triggering this.

    Earlier this week I did some network capture and confirmed that the issue is with my 3CX system failing to re-register with my trunk provider. It is neither delayed nor early. It simply stops working after some point in time.

    So, the problem was caused by the most recent patch. Since it doesn't happen to everyone, I am guessing that the issue is limited to Linux and, to some extend, related to the environment. Furthermore, people using IP trunks wouldn't encounter the issue and some others users might not have noticed or are assuming that the issue is with their trunk provider.

    Anyhow, I was planning on collecting another round of logs this weekend, but I can now report that the issue is resolved.

    How did I resolve it? I upgraded to 15.5.10072.4.

    For me, as reported earlier, the issue started happening by installing the previous patch and now it got resolved by installing another patch. Considering the observation from the trunk provider and the information found in my own logs, there is enough evidence that this problem was indeed a problem with the previous patch.

    @Yiannis I am working in software development myself and so I am feeling quite confident that the issue was indeed caused by the previous patch. I would suggest that you reach out to a developer to take a look at the previous patch and compare the code with the most recent beta. Obviously, something has changed. I mean the latest beta even tells me on the dashboard that my trunk is not supported.

    Hope this will help other users!
     
  17. YiannisH_3CX

    YiannisH_3CX Support Team
    Staff Member 3CX Support

    Joined:
    May 10, 2016
    Messages:
    4,864
    Likes Received:
    320
    Hello @JST and i am glad to see that your issue has been resolved. However i do believe that the issue was not with the previous build but with your specific environment. The reason i am saying this is that the majority of the systems use a Voip provider and most of our supported providers use Authentication based trunks. If this was an issue with the build i am sure that we would be overwhelmed with reports about this. Also a large number of our test trunks are running on a linux environment and we faced no such issue with registration. I cannot be certain what caused your issue we would require logs and wireshark captures to do so. If you have captures showing the issue then send me a p.m. so i take a look and perhaps see what was happening.
     
  18. JST

    JST New Member

    Joined:
    Jan 8, 2017
    Messages:
    107
    Likes Received:
    1
    Well, my environment is VMware on the latest released version. I would think that's pretty common and it also limits hardware issues. I have also double checked if there was a firmware upgrade on the Sophos security appliance, but there hasn't been one for more than 6 weeks now.

    On my end, I have no reason to not believe the trunk provider that my system has stopped re-registering. Log or not.I see no reason why they would give me incorrect information.

    I also feel that it is proven that, in my case, it had to do with the previous release because the issue started upon installing it and it stopped upon upgrading.

    Since I can no longer create any wireshark captures, I can only offer to get any other logs from my system that could be helpful.

    Please let me know what logs you want and I will send them to you by PM.
     
    #18 JST, Mar 26, 2018
    Last edited: Mar 28, 2018
  19. YiannisH_3CX

    YiannisH_3CX Support Team
    Staff Member 3CX Support

    Joined:
    May 10, 2016
    Messages:
    4,864
    Likes Received:
    320
    I do see your point of view and i do understand your way of thinking. And i am not not trying to blame the provider as we know the provider works properly, as it a supported and tested provider. I am just stating that if it was an issue with the previous build all people with registration trunks would be affected. Since we have no other reports regarding this i am assuming (since we have no logs) that the issue was specifically with your PBX /environment. Perhaps a corrupted file or service that was overwritten with the upgrade.
    Since the system was upgraded and you can no longer replicate the issue logs from a working system will not help point out the issue. If you have any captures taken before the upgrade then it would be interesting to see them in case we can determine something from there.
     
  20. JST

    JST New Member

    Joined:
    Jan 8, 2017
    Messages:
    107
    Likes Received:
    1
    I have another update. When I finally upgraded my system to Debian 9, the issue came back.

    Now, I understand that this might be coincidence and it also happened only once so far. Furthermore, I have applied the latest beta today.

    So, I am going to monitor this once more this week and if the issue is still ongoing by the end of this week, I will work on some wireshark captures. I might have to put wireshark on a separate machine since I have no experience with installing it on Debian Linux, but I am certainly willing to get to the bottom of this.

    Will report back later on this week.

    Now, the interesting question is: If it didn't get fixed by the patch, what made it stop temporarily?
     
Thread Status:
Not open for further replies.