[Freeswitch-dev] Memory Leak in mod_sofia

Anthony Minessale anthony.minessale at gmail.com
Tue Dec 10 21:54:25 MSK 2013


See latest HEAD and patch in FS-6006

It seems in some conditions the code around state transitions that is
supposed to parse the signal data makes it past there.
The new 1 line patch adds a parse_all_messages in the heart of the state
machine where it is executed every time the state machine sets a new
running state.

I was able to lab up a bye glare scenario with sipp where I could reproduce
the queued_events bug in 1 second flat with 200 simo calls.
After this patch I was able to run the same test for 20000 calls without
triggering any delta in the queued_events






On Mon, Dec 9, 2013 at 9:26 PM, David Sanders <dsanders at pinger.com> wrote:

> I've dug into this more after suggestions from Anthony and others in
> IRC, and reproduced against 1.2.stable.
>
> I think there's a couple of leak cases, but I've got one large one which
> I'll go into further.
>
> Our main use case is bridging audio, and most of the leaks reference leg
> B. After tracking down some more SIP for leaked calls, with timestamps,
> I can see that the leak appears to happen if events happen nearly
> simultaneously on leg A and leg B. One example would be both legs saying
> BYE (bye glare) at the same time. Another example I saw was a CANCEL
> coming in on leg A at the same time as a 200 OK on leg B.
>
> My current theory is that any events in the signal data queue for leg B
> will get leaked, and in turn leak references to the nua_handle_t. I
> can't find any code which 'pumps' the signal data queue on session
> destroy. To prevent the leak, I think this needs to happen. The events
> don't need to be processed since the session is being destroyed, but
> they should be pumped out of the signal data queue, free'd, and their
> reference to the handle should be released.
>
> Logically it makes sense that most of the leaked events I observed would
> be 100 BYE Trying, since simultaneous BYE is probably the most likely
> case for the race condition to occur in.
>
> As per Anthony's suggestion, I added trace to where the events are
> queued, and to where they are processed. Grepping the logs I gathered
> from a 24 hour test run, I can see discrepancies between queued and
> processed, just by event types.
>
> I've created the table below to illustrate the difference in events
> queued vs processed that I'm seeing. I'm only using one SIP profile, and
> I can see in memory that 'queued_events' on the struct = 87. The table
> adds up to -85, so there's likely one event type I didn't notice in my
> trace.
>
> +------------------+--------+----------+-----------+------+
> | event            | status | in count | out count | diff |
> +------------------+--------+----------+-----------+------+
> | nua_i_active     |  200   | 86717    | 86711     |  -6 |
> | nua_i_invite     |  100   | 77054    | 77054     |  -1  |
> | nua_i_terminated |  200   | 32588    | 32588     |   0  |
> | nua_i_terminated |  487   | 29156    | 29155     |  -1 |
> | nua_i_state      |   0    | 45443    | 45443     |   0 |
> | nua_i_state      |  180   | 73921    | 73917     |  -4 |
> | nua_i_state      |  200   | 206908   | 206898    | -10  |
> | nua_i_state      |  487   | 29156    | 29155     |  -1  |
> | nua_r_invite     |  180   | 2411     | 2411      |   0  |
> | nua_r_invite     |  183   | 36423    | 36418     |  -5  |
> | nua_r_invite     |  200   | 32075    | 32071     |  -4 |
> | nua_r_invite     |  408   | 1543     | 1543      |   0 |
> | nua_r_invite     |  486   | 2479     | 2479      |   0 |
> | nua_r_cancel     |  200   | 9099     | 9099      |   0 |
> | nua_r_bye        |  200   | 32491    | 32491     |   0 |
> | nua_i_bye        |  100   | 52363    | 52310     | -53|
> +------------------+--------+----------+-----------+------+
>
> - David
>
>
> On 12/6/13 12:20 PM, David Sanders wrote:
> > I've been trying to track down and kill a memory leak which is occurring
> > in mod_sofia/Sofia-SIP. The leak was first spotted under 1.2.0 and
> > confirmed to still exist in 1.2.14 (latest when it was tested about a
> > month ago).
> >
> > Two JIRA bugs were submitted as part of this hunt, FS-6005 and FS-6006.
> > The issue improved on 1.2.0 after the patches for those two bugs, but
> > showed no difference on 1.2.14.
> >
> > The leak only seems to occur under heavy load (i.e. production traffic),
> > but will produce several leaks within 24 hours of a FS restart. Gory
> > details follow.
> >
> > Things known about the leak:
> >     * The leaked memory is a nua_handle_t struct, and the associated
> > structs which are free'd properly if the nua_handle_t doesn't leak
> >     * The box running FS is a CentOS 6.3 machine, Linux kernel version
> > 2.6.32-279
> >     * Leak can be analyzed on an idle FS server using GDB + a modified
> > version of gdb-heap (an extension for GDB for viewing the heap)
> >        * Once leaked memory is identified, it can be printed with GDB as
> > the nua_handle_t struct and data can be gleaned from it
> >
> > After the patches mentioned above, 1.2.0 has several handles leaked, all
> > with a sofia_dispatch_event_t struct in their memory pool, indicating a
> > 100 BYE Trying event. Since this event was never processed (and free'd)
> > it never decremented the ref count on the handle. The handle is marked
> > as destroyed, so the ref count is the only thing keeping it around.
> >
> > On 1.2.14, it acted the same prior to FS-6006's patch. After that patch
> > was applied, the leak mutated. The idle box lists no calls and no
> > channels, but the leaked handles are not marked as destroyed, have
> > nh_active_call=1, and have a ref count of 3.
> >
> > Prior to FS-6006 it was clear with ref trace enabled for mod_sofia that
> > the missing unref was due to the 100 BYE Trying never being processed.
> > I'll have to re-enable the ref trace to see why there are now 3
> > references left after FS-6006.
> >
> > Here's a Valgrind entry for the leaked memory (this is from a 1.2.14
> > instance):
> >
> > ==28083== 22,980 (1,120 direct, 21,860 indirect) bytes in 4 blocks are
> definitely lost in loss record 448 of 453
> > ==28083==    at 0x4C2677B: calloc (vg_replace_malloc.c:593)
> > ==28083==    by 0xBA60CC1: su_home_new (su_alloc.c:559)
> > ==28083==    by 0xBA0D8F5: nh_create_handle (nua_common.c:113)
> > ==28083==    by 0xBA2AA2A: nua_handle (nua.c:315)
> > ==28083==    by 0xB9AAD38: sofia_glue_do_invite (sofia_glue.c:2357)
> > ==28083==    by 0xB9729CC: sofia_on_init (mod_sofia.c:108)
> > ==28083==    by 0x5138702: switch_core_session_run
> (switch_core_state_machine.c:424)
> > ==28083==    by 0x5134D8D: switch_core_session_thread
> (switch_core_session.c:1417)
> > ==28083==    by 0x596C850: start_thread (in /lib64/libpthread-2.12.so)
> > ==28083==    by 0x6AEB11C: clone (in /lib64/libc-2.12.so)
> >
> >
> > I've exhausted my ability to debug the issue, and could use some help
> > from the FS maintainers/community in killing the issue. The leaked
> > memory, while small, appears to cause rather large performance problems
> > for the FS instance if allowed to accumulate.
> >
> > - David
> >
> > P.S. I'll be on the freeswitch-dev IRC channel if anyone would like to
> > discuss more details of the leak.
> >
> > _________________________________________________________________________
> > Professional FreeSWITCH Consulting Services:
> > consulting at freeswitch.org
> > http://www.freeswitchsolutions.com
> >
> > 
> > 
> >
> > Official FreeSWITCH Sites
> > http://www.freeswitch.org
> > http://wiki.freeswitch.org
> > http://www.cluecon.com
> >
> > FreeSWITCH-dev mailing list
> > FreeSWITCH-dev at lists.freeswitch.org
> > http://lists.freeswitch.org/mailman/listinfo/freeswitch-dev
> > UNSUBSCRIBE:http://lists.freeswitch.org/mailman/options/freeswitch-dev
> > http://www.freeswitch.org
>
>
> _________________________________________________________________________
> Professional FreeSWITCH Consulting Services:
> consulting at freeswitch.org
> http://www.freeswitchsolutions.com
>
> 
> 
>
> Official FreeSWITCH Sites
> http://www.freeswitch.org
> http://wiki.freeswitch.org
> http://www.cluecon.com
>
> FreeSWITCH-dev mailing list
> FreeSWITCH-dev at lists.freeswitch.org
> http://lists.freeswitch.org/mailman/listinfo/freeswitch-dev
> UNSUBSCRIBE:http://lists.freeswitch.org/mailman/options/freeswitch-dev
> http://www.freeswitch.org
>



-- 
Anthony Minessale II

☞ http://www.freeswitch.org/http://www.cluecon.com/http://twitter.com/FreeSWITCH
☞ irc.freenode.net #freeswitch

ClueCon Weekly Development Call
☎ sip:888 at conference.freeswitch.org  ☎ +19193869900
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-dev/attachments/20131210/aca5ee16/attachment-0001.html 


Join us at ClueCon 2013 Aug 6-8, 2013
More information about the FreeSWITCH-dev mailing list