[Freeswitch-users] registration fails after several hours - FS problem?

Mario G mario_fs at mgtech.com
Sun Nov 7 12:22:28 PST 2010


A final update for those interested. Anthony worked many hours last Mon-Wed (11/2-3) and solved this issue. I leave it up to him if he wants to explain/add comments. I thoroughly tested FS for 48 hours twice and shorter two other times. Not once did it fail to register, even after an isp upgrade knocked the line out 7 times one night. FS is recovering from lost connections and reregistering perfectly every time as far as I can tell. The fix was applied to the git on 11/3.
MANY THANKS TO ANTHONY!
Mario G

On Nov 2, 2010, at 5:01 PM, pbdlists at pinboard.com wrote:

> Hello Anthony,
> 
> Indeed, even though I can't understand most of what is going on in this
> debug output, it is helpful. Here what I found (I still have the traces,
> but can't put them anywhere public):
> 
> - just before the registrations go away, freeswitch says it got an ICMP
>  type 3 code 1 (no route to host):
>     ESC[mESC[mtport_wakeup_pri(0x7f9030004520): events ERR
>     ESC[mESC[mtport_udp_error: No route to host (113) [icmp type=3 code=1]
>     ESC[mESC[m      reported by [188..........9]:0
>     ESC[mESC[mnta_agent: tport: 88........1:1024: No route to host
> - a tcpdump, however, shows no such ICMP packet
> - routes are static and a dump of the routing table every 5 seconds
>  shows that the default route (used for these destinations) is there
> - some more testing, capturing and searching shows a very interesting
> behaviour:
>  - none of the network interfaces used to communicate to anywhere
>    outside of the box do show the ICMP reported by freeswitch, neither
>    the external facing interface, nor the internal facing interface
>  - BUT:
>    - on the loopback interface I do get ICMP type 3 code 1 (host
>      unreachable) messages
>    - the ICMP messages I see there are only for systems which which
>      freeswitch is communicating
>    - the ICMP messages I see there are exactly for the remote systems
>      which were reported down, plus for one internal registrations)
>    - the timestamps of the ICMP messages starting and the registrations
>      going down match, as well as the timestamps of the ICMP messages
>      stopping and the registrations coming up again
> 
> I didn't change much on the default config and as far as I know nothing
> network related. Is it possible that I nevertheless messed up my config
> somewhere, causing freeswitch to chose the loopback interface for
> communicating from time to time?
> 
> Cheers,
> 
> Kurt
> 
> On Mon, Nov 01, 2010 at 06:59:27PM -0500, Anthony Minessale wrote:
>> what would help is if you can get a similar log with the siptrace on
>> the profile and sofia global loglevel 9
>> The key is the catch the very first time it goes wrong, possibly a
>> full pcap of any network activity as well to look for more clues.
>> 
>> This appears to be some sort of strange environmental condition or
>> particular edge case that breaks the sip lib internally.
>> 
>> 
>> 
>> On Mon, Nov 1, 2010 at 5:51 PM,  <pbdlists at pinboard.com> wrote:
>>> Just a quick note. Mario mentions he only sees the problems on osX. I see the
>>> exactly same errors and warning in my logs on a Linux box (Fedora 12 64-bit).
>>> Sometimes it happens every couple of minutes, sometimes it goes away for 2-3
>>> hours.
>>> 
>>> The excerpt attached is from a log of a freshly compiled git checkout. What I see
>>> is that if it happens, usually multiple external registrations go down, not just
>>> one or just the registrations with one server/provider.
>>> 
>>> Cheers,
>>> 
>>> Kurt
>>> 
>>> On Sun, Oct 31, 2010 at 12:24:35PM -0700, Mario G wrote:
>>>> I have the pcap and dump to email to you and lot's of new info on this serious bug (yes it's a bug on FS for osX). The pcap is 1.1M and dump is 350M, please tell me where to send them. I don't want to put then in public areas since they contain security info. Please review my steps below. I don't know FS or Linux internals but it seems a lot like a timing issue where two processes are not communicating with each other since retry messages occur but there is no SIP tracing going on. THANKS SO MUCH!
>>>> 
>>>> LINUX
>>>> 1. Setup FS on OpenSuse starting Sep 15. After basic initial problems there was a serious nat/upnp problems that lasted 3 weeks. Fixed with help, but still used nat.
>>>> 2. Final testing was on git 2010-10-13. Ran fine for 5 days on very old 32 bit system.
>>>> 
>>>> OSX
>>>> 3. Purchased Mac Mini and installed FS git 2010-10-23. Lasted only 3 to 17 hours. Problems looked same as nat so switched to full static.
>>>> 4. With all static (-nonat) and only one DSL static connection active ITSPs go down in 5-60 minutes one by one. Still thought it was network related. Sent you traces.
>>>> 5. Updated to git 10-29 but made no difference.
>>>> 
>>>> LINUX
>>>> 6. Went back to the Linux box with git 10-13 using copy of config from mac. Pure static as osX. No problems for 6 hours!
>>>> 7. Copied and updated Linux to git 10-29 to be the same as Mac box. Again, no problems for 12 hours!
>>>> 
>>>> OSX
>>>> 8. Went back to the mac to provide you with pcap and dump. In about 15 minutes FS lost 2 ITSPs. Here are messages issues during pcap/dump, NOTE clock message which is first I have seen of it:
>>>> 
>>>> 2010-10-31 11:35:00.593970 [WARNING] sofia_reg.c:387 idone Failed Registration, setting retry to 15 seconds.
>>>> 2010-10-31 11:35:13.118634 [NOTICE] sofia_reg.c:342 Registering idtwo
>>>> 2010-10-31 11:35:16.432236 [NOTICE] sofia_reg.c:342 Registering idone
>>>> 2010-10-31 11:35:19.898319 [CRIT] switch_time.c:760 Forward Clock Skew Detected!
>>>> 2010-10-31 11:35:25.440207 [WARNING] switch_scheduler.c:114 Task was executed late by 2 seconds 1 heartbeat (core)
>>>> 2010-10-31 11:35:29.946329 [WARNING] sofia_reg.c:387 idtwo Failed Registration, setting retry to 15 seconds.
>>>> 2010-10-31 11:35:32.147466 [WARNING] sofia_reg.c:387 idone Failed Registration, setting retry to 15 seconds.
>>>> 
>>>> I found the instruction for PCAP and TCPDUMP here in case you need them:
>>>> http://support.apple.com/kb/HT3994
>>>> http://www.osxbook.com/book/bonus/chapter8/core/
>>>> 
>>>> Note: I had the Mini set to no sleep even though it worked with Linux sleep. I found a couple others on the web who had the same problem and one had written a script to restart FS every 4 hours. Fried (tired) right now and cant find the URL but it was from Jan 2010.
>>>> 
>>>> One last thing to mention is that on osX using auto-nat:1.2.3.4 and some expiry parms, etc that may have triggered activity, FS worked much longer than on static. This is why I think it's timer or sync related and only on osX.
>>>> 
>>> 
>> 
>> 
>> -- 
>> Anthony Minessale II
>> 
>> FreeSWITCH http://www.freeswitch.org/
>> ClueCon http://www.cluecon.com/
>> Twitter: http://twitter.com/FreeSWITCH_wire
>> 
>> AIM: anthm
>> MSN:anthony_minessale at hotmail.com
>> GTALK/JABBER/PAYPAL:anthony.minessale at gmail.com
>> IRC: irc.freenode.net #freeswitch
>> 
>> FreeSWITCH Developer Conference
>> sip:888 at conference.freeswitch.org
>> googletalk:conf+888 at conference.freeswitch.org
>> pstn:+19193869900
>> 
> 
> -- 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-users/attachments/20101107/1d362ebd/attachment-0001.html 


More information about the FreeSWITCH-users mailing list