[Freeswitch-dev] Memory Leak in mod_sofia

David Sanders dsanders at pinger.com
Tue Dec 17 03:17:22 MSK 2013


Follow up on this:

The patch Anthony provided in FS-6006 looks good. After almost a week of 
running queued_events is still 0, and memory usage on the machine looks 
much better. More importantly, audio quality isn't degrading like we 
were seeing when the memory leak existed.

There's still another memory leak regarding Sofia-SIP that seems to 
occur if a 183, 200, 183 sequence occurs on an outbound call, but I 
haven't fully quantified or reproduced it yet. It is far more rare, but 
once I corner it I'll provide either a patch or bug report.

- David


On 12/10/13 2:15 PM, David Sanders wrote:
> Thanks for the quick turn around. Currently trying it out to see how 
> things look after 24 hours, and will report back when that finishes 
> and is analyzed.
>
> Glad to hear you were able to reproduce in a lab setup, that was going 
> to be my next step after documenting the issue, but looks like you 
> beat me to it.
>
> - David
>
>
> On 12/10/13 10:54 AM, Anthony Minessale wrote:
>> See latest HEAD and patch in FS-6006
>>
>> It seems in some conditions the code around state transitions that is 
>> supposed to parse the signal data makes it past there.
>> The new 1 line patch adds a parse_all_messages in the heart of the 
>> state machine where it is executed every time the state machine sets 
>> a new running state.
>>
>> I was able to lab up a bye glare scenario with sipp where I could 
>> reproduce the queued_events bug in 1 second flat with 200 simo calls.
>> After this patch I was able to run the same test for 20000 calls 
>> without triggering any delta in the queued_events
>>
>>
>>
>>
>>
>>
>> On Mon, Dec 9, 2013 at 9:26 PM, David Sanders <dsanders at pinger.com 
>> <mailto:dsanders at pinger.com>> wrote:
>>
>>     I've dug into this more after suggestions from Anthony and others in
>>     IRC, and reproduced against 1.2.stable.
>>
>>     I think there's a couple of leak cases, but I've got one large
>>     one which
>>     I'll go into further.
>>
>>     Our main use case is bridging audio, and most of the leaks
>>     reference leg
>>     B. After tracking down some more SIP for leaked calls, with
>>     timestamps,
>>     I can see that the leak appears to happen if events happen nearly
>>     simultaneously on leg A and leg B. One example would be both legs
>>     saying
>>     BYE (bye glare) at the same time. Another example I saw was a CANCEL
>>     coming in on leg A at the same time as a 200 OK on leg B.
>>
>>     My current theory is that any events in the signal data queue for
>>     leg B
>>     will get leaked, and in turn leak references to the nua_handle_t. I
>>     can't find any code which 'pumps' the signal data queue on session
>>     destroy. To prevent the leak, I think this needs to happen. The
>>     events
>>     don't need to be processed since the session is being destroyed, but
>>     they should be pumped out of the signal data queue, free'd, and their
>>     reference to the handle should be released.
>>
>>     Logically it makes sense that most of the leaked events I
>>     observed would
>>     be 100 BYE Trying, since simultaneous BYE is probably the most likely
>>     case for the race condition to occur in.
>>
>>     As per Anthony's suggestion, I added trace to where the events are
>>     queued, and to where they are processed. Grepping the logs I gathered
>>     from a 24 hour test run, I can see discrepancies between queued and
>>     processed, just by event types.
>>
>>     I've created the table below to illustrate the difference in events
>>     queued vs processed that I'm seeing. I'm only using one SIP
>>     profile, and
>>     I can see in memory that 'queued_events' on the struct = 87. The
>>     table
>>     adds up to -85, so there's likely one event type I didn't notice
>>     in my
>>     trace.
>>
>>     +------------------+--------+----------+-----------+------+
>>     | event            | status | in count | out count | diff |
>>     +------------------+--------+----------+-----------+------+
>>     | nua_i_active     |  200   | 86717    | 86711     |  -6 |
>>     | nua_i_invite     |  100   | 77054    | 77054     |  -1  |
>>     | nua_i_terminated |  200   | 32588    | 32588     |   0  |
>>     | nua_i_terminated |  487   | 29156    | 29155     |  -1 |
>>     | nua_i_state      |   0    | 45443    | 45443     |   0 |
>>     | nua_i_state      |  180   | 73921    | 73917     |  -4 |
>>     | nua_i_state      |  200   | 206908   | 206898    | -10  |
>>     | nua_i_state      |  487   | 29156    | 29155     |  -1  |
>>     | nua_r_invite     |  180   | 2411     | 2411      |   0  |
>>     | nua_r_invite     |  183   | 36423    | 36418     |  -5  |
>>     | nua_r_invite     |  200   | 32075    | 32071     |  -4 |
>>     | nua_r_invite     |  408   | 1543     | 1543      |   0 |
>>     | nua_r_invite     |  486   | 2479     | 2479      |   0 |
>>     | nua_r_cancel     |  200   | 9099     | 9099      |   0 |
>>     | nua_r_bye        |  200   | 32491    | 32491     |   0 |
>>     | nua_i_bye        |  100   | 52363    | 52310     | -53|
>>     +------------------+--------+----------+-----------+------+
>>
>>     - David
>>
>>
>>     On 12/6/13 12:20 PM, David Sanders wrote:
>>     > I've been trying to track down and kill a memory leak which is
>>     occurring
>>     > in mod_sofia/Sofia-SIP. The leak was first spotted under 1.2.0 and
>>     > confirmed to still exist in 1.2.14 (latest when it was tested
>>     about a
>>     > month ago).
>>     >
>>     > Two JIRA bugs were submitted as part of this hunt, FS-6005 and
>>     FS-6006.
>>     > The issue improved on 1.2.0 after the patches for those two
>>     bugs, but
>>     > showed no difference on 1.2.14.
>>     >
>>     > The leak only seems to occur under heavy load (i.e. production
>>     traffic),
>>     > but will produce several leaks within 24 hours of a FS restart.
>>     Gory
>>     > details follow.
>>     >
>>     > Things known about the leak:
>>     >     * The leaked memory is a nua_handle_t struct, and the
>>     associated
>>     > structs which are free'd properly if the nua_handle_t doesn't leak
>>     >     * The box running FS is a CentOS 6.3 machine, Linux kernel
>>     version
>>     > 2.6.32-279
>>     >     * Leak can be analyzed on an idle FS server using GDB + a
>>     modified
>>     > version of gdb-heap (an extension for GDB for viewing the heap)
>>     >        * Once leaked memory is identified, it can be printed
>>     with GDB as
>>     > the nua_handle_t struct and data can be gleaned from it
>>     >
>>     > After the patches mentioned above, 1.2.0 has several handles
>>     leaked, all
>>     > with a sofia_dispatch_event_t struct in their memory pool,
>>     indicating a
>>     > 100 BYE Trying event. Since this event was never processed (and
>>     free'd)
>>     > it never decremented the ref count on the handle. The handle is
>>     marked
>>     > as destroyed, so the ref count is the only thing keeping it around.
>>     >
>>     > On 1.2.14, it acted the same prior to FS-6006's patch. After
>>     that patch
>>     > was applied, the leak mutated. The idle box lists no calls and no
>>     > channels, but the leaked handles are not marked as destroyed, have
>>     > nh_active_call=1, and have a ref count of 3.
>>     >
>>     > Prior to FS-6006 it was clear with ref trace enabled for
>>     mod_sofia that
>>     > the missing unref was due to the 100 BYE Trying never being
>>     processed.
>>     > I'll have to re-enable the ref trace to see why there are now 3
>>     > references left after FS-6006.
>>     >
>>     > Here's a Valgrind entry for the leaked memory (this is from a
>>     1.2.14
>>     > instance):
>>     >
>>     > ==28083== 22,980 (1,120 direct, 21,860 indirect) bytes in 4
>>     blocks are definitely lost in loss record 448 of 453
>>     > ==28083==    at 0x4C2677B: calloc (vg_replace_malloc.c:593)
>>     > ==28083==    by 0xBA60CC1: su_home_new (su_alloc.c:559)
>>     > ==28083==    by 0xBA0D8F5: nh_create_handle (nua_common.c:113)
>>     > ==28083==    by 0xBA2AA2A: nua_handle (nua.c:315)
>>     > ==28083==    by 0xB9AAD38: sofia_glue_do_invite (sofia_glue.c:2357)
>>     > ==28083==    by 0xB9729CC: sofia_on_init (mod_sofia.c:108)
>>     > ==28083==    by 0x5138702: switch_core_session_run
>>     (switch_core_state_machine.c:424)
>>     > ==28083==    by 0x5134D8D: switch_core_session_thread
>>     (switch_core_session.c:1417)
>>     > ==28083==    by 0x596C850: start_thread (in
>>     /lib64/libpthread-2.12.so <http://libpthread-2.12.so>)
>>     > ==28083==    by 0x6AEB11C: clone (in /lib64/libc-2.12.so
>>     <http://libc-2.12.so>)
>>     >
>>     >
>>     > I've exhausted my ability to debug the issue, and could use
>>     some help
>>     > from the FS maintainers/community in killing the issue. The leaked
>>     > memory, while small, appears to cause rather large performance
>>     problems
>>     > for the FS instance if allowed to accumulate.
>>     >
>>     > - David
>>     >
>>     > P.S. I'll be on the freeswitch-dev IRC channel if anyone would
>>     like to
>>     > discuss more details of the leak.
>>     >
>>     >
>>     _________________________________________________________________________
>>     > Professional FreeSWITCH Consulting Services:
>>     > consulting at freeswitch.org <mailto:consulting at freeswitch.org>
>>     > http://www.freeswitchsolutions.com
>>     >
>>     > 
>>     > 
>>     >
>>     > Official FreeSWITCH Sites
>>     > http://www.freeswitch.org
>>     > http://wiki.freeswitch.org
>>     > http://www.cluecon.com
>>     >
>>     > FreeSWITCH-dev mailing list
>>     > FreeSWITCH-dev at lists.freeswitch.org
>>     <mailto:FreeSWITCH-dev at lists.freeswitch.org>
>>     > http://lists.freeswitch.org/mailman/listinfo/freeswitch-dev
>>     >
>>     UNSUBSCRIBE:http://lists.freeswitch.org/mailman/options/freeswitch-dev
>>     > http://www.freeswitch.org
>>
>>
>>     _________________________________________________________________________
>>     Professional FreeSWITCH Consulting Services:
>>     consulting at freeswitch.org <mailto:consulting at freeswitch.org>
>>     http://www.freeswitchsolutions.com
>>
>>     
>>     
>>
>>     Official FreeSWITCH Sites
>>     http://www.freeswitch.org
>>     http://wiki.freeswitch.org
>>     http://www.cluecon.com
>>
>>     FreeSWITCH-dev mailing list
>>     FreeSWITCH-dev at lists.freeswitch.org
>>     <mailto:FreeSWITCH-dev at lists.freeswitch.org>
>>     http://lists.freeswitch.org/mailman/listinfo/freeswitch-dev
>>     UNSUBSCRIBE:http://lists.freeswitch.org/mailman/options/freeswitch-dev
>>     http://www.freeswitch.org
>>
>>
>>
>>
>> -- 
>> Anthony Minessale II
>>
>>http://www.freeswitch.org/http://www.cluecon.com/>> http://twitter.com/FreeSWITCH
>> ☞ irc.freenode.net <http://irc.freenode.net> #freeswitch
>>
>> ClueCon Weekly Development Call
>> ☎ sip:888 at conference.freeswitch.org 
>> <mailto:sip%3A888 at conference.freeswitch.org>  ☎ +19193869900
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-dev/attachments/20131216/27463ea3/attachment-0001.html 


Join us at ClueCon 2013 Aug 6-8, 2013
More information about the FreeSWITCH-dev mailing list