<div dir="ltr">See latest HEAD and patch in FS-6006<div><br></div><div>It seems in some conditions the code around state transitions that is supposed to parse the signal data makes it past there.</div><div>The new 1 line patch adds a parse_all_messages in the heart of the state machine where it is executed every time the state machine sets a new running state.</div>
<div><br></div><div>I was able to lab up a bye glare scenario with sipp where I could reproduce the queued_events bug in 1 second flat with 200 simo calls.</div><div>After this patch I was able to run the same test for 20000 calls without triggering any delta in the queued_events</div>
<div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Dec 9, 2013 at 9:26 PM, David Sanders <span dir="ltr"><<a href="mailto:dsanders@pinger.com" target="_blank">dsanders@pinger.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I've dug into this more after suggestions from Anthony and others in<br>
IRC, and reproduced against 1.2.stable.<br>
<br>
I think there's a couple of leak cases, but I've got one large one which<br>
I'll go into further.<br>
<br>
Our main use case is bridging audio, and most of the leaks reference leg<br>
B. After tracking down some more SIP for leaked calls, with timestamps,<br>
I can see that the leak appears to happen if events happen nearly<br>
simultaneously on leg A and leg B. One example would be both legs saying<br>
BYE (bye glare) at the same time. Another example I saw was a CANCEL<br>
coming in on leg A at the same time as a 200 OK on leg B.<br>
<br>
My current theory is that any events in the signal data queue for leg B<br>
will get leaked, and in turn leak references to the nua_handle_t. I<br>
can't find any code which 'pumps' the signal data queue on session<br>
destroy. To prevent the leak, I think this needs to happen. The events<br>
don't need to be processed since the session is being destroyed, but<br>
they should be pumped out of the signal data queue, free'd, and their<br>
reference to the handle should be released.<br>
<br>
Logically it makes sense that most of the leaked events I observed would<br>
be 100 BYE Trying, since simultaneous BYE is probably the most likely<br>
case for the race condition to occur in.<br>
<br>
As per Anthony's suggestion, I added trace to where the events are<br>
queued, and to where they are processed. Grepping the logs I gathered<br>
from a 24 hour test run, I can see discrepancies between queued and<br>
processed, just by event types.<br>
<br>
I've created the table below to illustrate the difference in events<br>
queued vs processed that I'm seeing. I'm only using one SIP profile, and<br>
I can see in memory that 'queued_events' on the struct = 87. The table<br>
adds up to -85, so there's likely one event type I didn't notice in my<br>
trace.<br>
<br>
+------------------+--------+----------+-----------+------+<br>
| event | status | in count | out count | diff |<br>
+------------------+--------+----------+-----------+------+<br>
| nua_i_active | 200 | 86717 | 86711 | -6 |<br>
| nua_i_invite | 100 | 77054 | 77054 | -1 |<br>
| nua_i_terminated | 200 | 32588 | 32588 | 0 |<br>
| nua_i_terminated | 487 | 29156 | 29155 | -1 |<br>
| nua_i_state | 0 | 45443 | 45443 | 0 |<br>
| nua_i_state | 180 | 73921 | 73917 | -4 |<br>
| nua_i_state | 200 | 206908 | 206898 | -10 |<br>
| nua_i_state | 487 | 29156 | 29155 | -1 |<br>
| nua_r_invite | 180 | 2411 | 2411 | 0 |<br>
| nua_r_invite | 183 | 36423 | 36418 | -5 |<br>
| nua_r_invite | 200 | 32075 | 32071 | -4 |<br>
| nua_r_invite | 408 | 1543 | 1543 | 0 |<br>
| nua_r_invite | 486 | 2479 | 2479 | 0 |<br>
| nua_r_cancel | 200 | 9099 | 9099 | 0 |<br>
| nua_r_bye | 200 | 32491 | 32491 | 0 |<br>
| nua_i_bye | 100 | 52363 | 52310 | -53|<br>
+------------------+--------+----------+-----------+------+<br>
<span class="HOEnZb"><font color="#888888"><br>
- David<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
<br>
On 12/6/13 12:20 PM, David Sanders wrote:<br>
> I've been trying to track down and kill a memory leak which is occurring<br>
> in mod_sofia/Sofia-SIP. The leak was first spotted under 1.2.0 and<br>
> confirmed to still exist in 1.2.14 (latest when it was tested about a<br>
> month ago).<br>
><br>
> Two JIRA bugs were submitted as part of this hunt, FS-6005 and FS-6006.<br>
> The issue improved on 1.2.0 after the patches for those two bugs, but<br>
> showed no difference on 1.2.14.<br>
><br>
> The leak only seems to occur under heavy load (i.e. production traffic),<br>
> but will produce several leaks within 24 hours of a FS restart. Gory<br>
> details follow.<br>
><br>
> Things known about the leak:<br>
> * The leaked memory is a nua_handle_t struct, and the associated<br>
> structs which are free'd properly if the nua_handle_t doesn't leak<br>
> * The box running FS is a CentOS 6.3 machine, Linux kernel version<br>
> 2.6.32-279<br>
> * Leak can be analyzed on an idle FS server using GDB + a modified<br>
> version of gdb-heap (an extension for GDB for viewing the heap)<br>
> * Once leaked memory is identified, it can be printed with GDB as<br>
> the nua_handle_t struct and data can be gleaned from it<br>
><br>
> After the patches mentioned above, 1.2.0 has several handles leaked, all<br>
> with a sofia_dispatch_event_t struct in their memory pool, indicating a<br>
> 100 BYE Trying event. Since this event was never processed (and free'd)<br>
> it never decremented the ref count on the handle. The handle is marked<br>
> as destroyed, so the ref count is the only thing keeping it around.<br>
><br>
> On 1.2.14, it acted the same prior to FS-6006's patch. After that patch<br>
> was applied, the leak mutated. The idle box lists no calls and no<br>
> channels, but the leaked handles are not marked as destroyed, have<br>
> nh_active_call=1, and have a ref count of 3.<br>
><br>
> Prior to FS-6006 it was clear with ref trace enabled for mod_sofia that<br>
> the missing unref was due to the 100 BYE Trying never being processed.<br>
> I'll have to re-enable the ref trace to see why there are now 3<br>
> references left after FS-6006.<br>
><br>
> Here's a Valgrind entry for the leaked memory (this is from a 1.2.14<br>
> instance):<br>
><br>
> ==28083== 22,980 (1,120 direct, 21,860 indirect) bytes in 4 blocks are definitely lost in loss record 448 of 453<br>
> ==28083== at 0x4C2677B: calloc (vg_replace_malloc.c:593)<br>
> ==28083== by 0xBA60CC1: su_home_new (su_alloc.c:559)<br>
> ==28083== by 0xBA0D8F5: nh_create_handle (nua_common.c:113)<br>
> ==28083== by 0xBA2AA2A: nua_handle (nua.c:315)<br>
> ==28083== by 0xB9AAD38: sofia_glue_do_invite (sofia_glue.c:2357)<br>
> ==28083== by 0xB9729CC: sofia_on_init (mod_sofia.c:108)<br>
> ==28083== by 0x5138702: switch_core_session_run (switch_core_state_machine.c:424)<br>
> ==28083== by 0x5134D8D: switch_core_session_thread (switch_core_session.c:1417)<br>
> ==28083== by 0x596C850: start_thread (in /lib64/<a href="http://libpthread-2.12.so" target="_blank">libpthread-2.12.so</a>)<br>
> ==28083== by 0x6AEB11C: clone (in /lib64/<a href="http://libc-2.12.so" target="_blank">libc-2.12.so</a>)<br>
><br>
><br>
> I've exhausted my ability to debug the issue, and could use some help<br>
> from the FS maintainers/community in killing the issue. The leaked<br>
> memory, while small, appears to cause rather large performance problems<br>
> for the FS instance if allowed to accumulate.<br>
><br>
> - David<br>
><br>
> P.S. I'll be on the freeswitch-dev IRC channel if anyone would like to<br>
> discuss more details of the leak.<br>
><br>
> _________________________________________________________________________<br>
> Professional FreeSWITCH Consulting Services:<br>
> <a href="mailto:consulting@freeswitch.org">consulting@freeswitch.org</a><br>
> <a href="http://www.freeswitchsolutions.com" target="_blank">http://www.freeswitchsolutions.com</a><br>
><br>
> FreeSWITCH-powered IP PBX: The CudaTel Communication Server<br>
> <a href="http://www.cudatel.com" target="_blank">http://www.cudatel.com</a><br>
><br>
> Official FreeSWITCH Sites<br>
> <a href="http://www.freeswitch.org" target="_blank">http://www.freeswitch.org</a><br>
> <a href="http://wiki.freeswitch.org" target="_blank">http://wiki.freeswitch.org</a><br>
> <a href="http://www.cluecon.com" target="_blank">http://www.cluecon.com</a><br>
><br>
> FreeSWITCH-dev mailing list<br>
> <a href="mailto:FreeSWITCH-dev@lists.freeswitch.org">FreeSWITCH-dev@lists.freeswitch.org</a><br>
> <a href="http://lists.freeswitch.org/mailman/listinfo/freeswitch-dev" target="_blank">http://lists.freeswitch.org/mailman/listinfo/freeswitch-dev</a><br>
> UNSUBSCRIBE:<a href="http://lists.freeswitch.org/mailman/options/freeswitch-dev" target="_blank">http://lists.freeswitch.org/mailman/options/freeswitch-dev</a><br>
> <a href="http://www.freeswitch.org" target="_blank">http://www.freeswitch.org</a><br>
<br>
<br>
_________________________________________________________________________<br>
Professional FreeSWITCH Consulting Services:<br>
<a href="mailto:consulting@freeswitch.org">consulting@freeswitch.org</a><br>
<a href="http://www.freeswitchsolutions.com" target="_blank">http://www.freeswitchsolutions.com</a><br>
<br>
FreeSWITCH-powered IP PBX: The CudaTel Communication Server<br>
<a href="http://www.cudatel.com" target="_blank">http://www.cudatel.com</a><br>
<br>
Official FreeSWITCH Sites<br>
<a href="http://www.freeswitch.org" target="_blank">http://www.freeswitch.org</a><br>
<a href="http://wiki.freeswitch.org" target="_blank">http://wiki.freeswitch.org</a><br>
<a href="http://www.cluecon.com" target="_blank">http://www.cluecon.com</a><br>
<br>
FreeSWITCH-dev mailing list<br>
<a href="mailto:FreeSWITCH-dev@lists.freeswitch.org">FreeSWITCH-dev@lists.freeswitch.org</a><br>
<a href="http://lists.freeswitch.org/mailman/listinfo/freeswitch-dev" target="_blank">http://lists.freeswitch.org/mailman/listinfo/freeswitch-dev</a><br>
UNSUBSCRIBE:<a href="http://lists.freeswitch.org/mailman/options/freeswitch-dev" target="_blank">http://lists.freeswitch.org/mailman/options/freeswitch-dev</a><br>
<a href="http://www.freeswitch.org" target="_blank">http://www.freeswitch.org</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr">Anthony Minessale II<br><br><div>☞ <a href="http://www.freeswitch.org/">http://www.freeswitch.org/</a> ☞ <a href="http://www.cluecon.com/">http://www.cluecon.com/</a> ☞ <a href="http://twitter.com/FreeSWITCH">http://twitter.com/FreeSWITCH</a></div>
<div><div>☞ <a href="http://irc.freenode.net">irc.freenode.net</a> #freeswitch<br><br></div><div>ClueCon Weekly Development Call <br></div><div>☎ <a href="mailto:sip%3A888@conference.freeswitch.org">sip:888@conference.freeswitch.org</a> ☎ +19193869900 </div>
<div><br></div></div></div>
</div>