[Freeswitch-dev] Memory Leak in mod_sofia

David Sanders dsanders at pinger.com
Tue Dec 10 06:26:49 MSK 2013


I've dug into this more after suggestions from Anthony and others in 
IRC, and reproduced against 1.2.stable.

I think there's a couple of leak cases, but I've got one large one which 
I'll go into further.

Our main use case is bridging audio, and most of the leaks reference leg 
B. After tracking down some more SIP for leaked calls, with timestamps, 
I can see that the leak appears to happen if events happen nearly 
simultaneously on leg A and leg B. One example would be both legs saying 
BYE (bye glare) at the same time. Another example I saw was a CANCEL 
coming in on leg A at the same time as a 200 OK on leg B.

My current theory is that any events in the signal data queue for leg B 
will get leaked, and in turn leak references to the nua_handle_t. I 
can't find any code which 'pumps' the signal data queue on session 
destroy. To prevent the leak, I think this needs to happen. The events 
don't need to be processed since the session is being destroyed, but 
they should be pumped out of the signal data queue, free'd, and their 
reference to the handle should be released.

Logically it makes sense that most of the leaked events I observed would 
be 100 BYE Trying, since simultaneous BYE is probably the most likely 
case for the race condition to occur in.

As per Anthony's suggestion, I added trace to where the events are 
queued, and to where they are processed. Grepping the logs I gathered 
from a 24 hour test run, I can see discrepancies between queued and 
processed, just by event types.

I've created the table below to illustrate the difference in events 
queued vs processed that I'm seeing. I'm only using one SIP profile, and 
I can see in memory that 'queued_events' on the struct = 87. The table 
adds up to -85, so there's likely one event type I didn't notice in my 
trace.

+------------------+--------+----------+-----------+------+
| event            | status | in count | out count | diff |
+------------------+--------+----------+-----------+------+
| nua_i_active     |  200   | 86717    | 86711     |  -6 |
| nua_i_invite     |  100   | 77054    | 77054     |  -1  |
| nua_i_terminated |  200   | 32588    | 32588     |   0  |
| nua_i_terminated |  487   | 29156    | 29155     |  -1 |
| nua_i_state      |   0    | 45443    | 45443     |   0 |
| nua_i_state      |  180   | 73921    | 73917     |  -4 |
| nua_i_state      |  200   | 206908   | 206898    | -10  |
| nua_i_state      |  487   | 29156    | 29155     |  -1  |
| nua_r_invite     |  180   | 2411     | 2411      |   0  |
| nua_r_invite     |  183   | 36423    | 36418     |  -5  |
| nua_r_invite     |  200   | 32075    | 32071     |  -4 |
| nua_r_invite     |  408   | 1543     | 1543      |   0 |
| nua_r_invite     |  486   | 2479     | 2479      |   0 |
| nua_r_cancel     |  200   | 9099     | 9099      |   0 |
| nua_r_bye        |  200   | 32491    | 32491     |   0 |
| nua_i_bye        |  100   | 52363    | 52310     | -53|
+------------------+--------+----------+-----------+------+

- David


On 12/6/13 12:20 PM, David Sanders wrote:
> I've been trying to track down and kill a memory leak which is occurring
> in mod_sofia/Sofia-SIP. The leak was first spotted under 1.2.0 and
> confirmed to still exist in 1.2.14 (latest when it was tested about a
> month ago).
>
> Two JIRA bugs were submitted as part of this hunt, FS-6005 and FS-6006.
> The issue improved on 1.2.0 after the patches for those two bugs, but
> showed no difference on 1.2.14.
>
> The leak only seems to occur under heavy load (i.e. production traffic),
> but will produce several leaks within 24 hours of a FS restart. Gory
> details follow.
>
> Things known about the leak:
>     * The leaked memory is a nua_handle_t struct, and the associated
> structs which are free'd properly if the nua_handle_t doesn't leak
>     * The box running FS is a CentOS 6.3 machine, Linux kernel version
> 2.6.32-279
>     * Leak can be analyzed on an idle FS server using GDB + a modified
> version of gdb-heap (an extension for GDB for viewing the heap)
>        * Once leaked memory is identified, it can be printed with GDB as
> the nua_handle_t struct and data can be gleaned from it
>
> After the patches mentioned above, 1.2.0 has several handles leaked, all
> with a sofia_dispatch_event_t struct in their memory pool, indicating a
> 100 BYE Trying event. Since this event was never processed (and free'd)
> it never decremented the ref count on the handle. The handle is marked
> as destroyed, so the ref count is the only thing keeping it around.
>
> On 1.2.14, it acted the same prior to FS-6006's patch. After that patch
> was applied, the leak mutated. The idle box lists no calls and no
> channels, but the leaked handles are not marked as destroyed, have
> nh_active_call=1, and have a ref count of 3.
>
> Prior to FS-6006 it was clear with ref trace enabled for mod_sofia that
> the missing unref was due to the 100 BYE Trying never being processed.
> I'll have to re-enable the ref trace to see why there are now 3
> references left after FS-6006.
>
> Here's a Valgrind entry for the leaked memory (this is from a 1.2.14
> instance):
>
> ==28083== 22,980 (1,120 direct, 21,860 indirect) bytes in 4 blocks are definitely lost in loss record 448 of 453
> ==28083==    at 0x4C2677B: calloc (vg_replace_malloc.c:593)
> ==28083==    by 0xBA60CC1: su_home_new (su_alloc.c:559)
> ==28083==    by 0xBA0D8F5: nh_create_handle (nua_common.c:113)
> ==28083==    by 0xBA2AA2A: nua_handle (nua.c:315)
> ==28083==    by 0xB9AAD38: sofia_glue_do_invite (sofia_glue.c:2357)
> ==28083==    by 0xB9729CC: sofia_on_init (mod_sofia.c:108)
> ==28083==    by 0x5138702: switch_core_session_run (switch_core_state_machine.c:424)
> ==28083==    by 0x5134D8D: switch_core_session_thread (switch_core_session.c:1417)
> ==28083==    by 0x596C850: start_thread (in /lib64/libpthread-2.12.so)
> ==28083==    by 0x6AEB11C: clone (in /lib64/libc-2.12.so)
>
>
> I've exhausted my ability to debug the issue, and could use some help
> from the FS maintainers/community in killing the issue. The leaked
> memory, while small, appears to cause rather large performance problems
> for the FS instance if allowed to accumulate.
>
> - David
>
> P.S. I'll be on the freeswitch-dev IRC channel if anyone would like to
> discuss more details of the leak.
>
> _________________________________________________________________________
> Professional FreeSWITCH Consulting Services:
> consulting at freeswitch.org
> http://www.freeswitchsolutions.com
>
> 
> 
>
> Official FreeSWITCH Sites
> http://www.freeswitch.org
> http://wiki.freeswitch.org
> http://www.cluecon.com
>
> FreeSWITCH-dev mailing list
> FreeSWITCH-dev at lists.freeswitch.org
> http://lists.freeswitch.org/mailman/listinfo/freeswitch-dev
> UNSUBSCRIBE:http://lists.freeswitch.org/mailman/options/freeswitch-dev
> http://www.freeswitch.org




Join us at ClueCon 2013 Aug 6-8, 2013
More information about the FreeSWITCH-dev mailing list