[Freeswitch-users] Hung Channels (SVN Rev 10231)

Eric Liedtke e at musinghalfwit.org
Thu Mar 5 14:38:42 PST 2009


Greetings, 

I've been using FS in production on this rev (I realize it's pretty far
behind current) and it's been running well, save 1 issue.

The basic setup is an SBC , 2 GiG-E ports, 1 public , 1 private. I have
2 sip profiles created , 1 per ip interface. This is being used to
terminate traffic to a provider so calls are only 1 direction. They come
into the private side profile, get routed via dialplan to the gateway
defined in the external profile and on to the vendor. Pretty simple.

I have noticed that under load (50 or so cps with ~800-900 bridged calls up)
that over time some channels on the public side seem to get "stuck".  Due to
the nature of how this is being used , I would expect both sip profiles to show
the same number of channels in use any time i do a 'sofia status' ( or at least
be within a channel or 2 of each other). However after a day of heavy use I had
a disparity of ~250 channels. These extra channels also seem to put some
continual load on the 'system cpu' as well , reported via top.

Of course due to the load on the box I have to keep logging turned way
down. So I've been trying to troubleshoot it as best I can.

Last night I grabbed a core file and started in with GDB today. I found
the 120 or so threads that represented real active calls when I took the
corefile, I also found ~250 threads that appeared to be stuck in the
CS_NEW state. The backtraces on all of them looks the same, annotated below.

I walked through the code path by hand , based on the bt's and I don't see how 
this could be happening  unless it's a locking issue. But as far as I can tell
each  session  has it's own mutex defined in the switch_core_session_t struct, 
so I wouldn't think they would be stepping on each other. I also would have expected
if it were something of a deadlock nature it would stop processing calls all 
together.

I grabbed the commands from the .gdbinit (super handy btw!!) and have been trolling 
through the variables to try to ascertain something about why these threads seem to 
be stuck, but am not having much luck even coming up with a scenario to try 
to replicate the issue.

If anyone has any pointers as to where I might look next it would be greatly
appreciated. 

We will be updating to the newest release soon, however I was hoping to nail down
what is going so I can systematically replicate it and verify by testing in the lab
that it is fixed , rather than just pushing the new release to produvction and hoping.

Thanks in advance for any tips/pointers anyone may have.

-e

......bt and bt full for a single "hung" thread


#0  0xb7fd5410 in __kernel_vsyscall ()
#1  0xb7d14cb6 in nanosleep () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7d4f1dc in usleep () from /lib/tls/i686/cmov/libc.so.6
#3  0xb7ee02cd in switch_sleep (t=1000) at src/switch_time.c:143
#4  0xb7e9da03 in switch_core_session_run (session=0x95fe270) at src/switch_core_state_machine.c:462
#5  0xb7e9c765 in switch_core_session_thread (thread=0x9ada840, obj=0x95fe270) at src/switch_core_session.c:853
#6  0xb7efd916 in dummy_worker (opaque=0x9ada840) at threadproc/unix/thread.c:138
#7  0xb7e034fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#8  0xb7d55e5e in clone () from /lib/tls/i686/cmov/libc.so.6
(gdb) bt full
#0  0xb7fd5410 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7d14cb6 in nanosleep () from /lib/tls/i686/cmov/libc.so.6
No symbol table info available.
#2  0xb7d4f1dc in usleep () from /lib/tls/i686/cmov/libc.so.6
No symbol table info available.
#3  0xb7ee02cd in switch_sleep (t=1000) at src/switch_time.c:143
No locals.
#4  0xb7e9da03 in switch_core_session_run (session=0x95fe270) at src/switch_core_state_machine.c:462
        exception = 0 '\0'
        state = <value optimized out>
        endstate = CS_NEW
        endpoint_interface = <value optimized out>
        driver_state_handler = (const switch_state_handler_table_t *) 0xb73b1720
        application_state_handler = <value optimized out>
        thread_id = 3085554955
        env = {{__jmpbuf = {134603552, -1428248680, -1461722504, 9184, -1210273432, -1210014020}, __mask_was_saved = -1210034895, __saved_mask = {__val = {0, 3084988404, 3084937740, 3086469280, 9184, 1, 2976641592, 2833244792, 3086590960, 
        168036728, 3084937740, 2833244808, 3085923728, 1, 3086590960, 2833244840, 3086590960, 0, 134564192, 2833244840, 3085923728, 134564244, 3086590960, 2833244872, 3085887870, 134564240, 168036728, 3085458203, 3086590960, 2976606624, 
        134564192, 2833244904}}}}
        sig = <value optimized out>
        __func__ = "switch_core_session_run"
        __PRETTY_FUNCTION__ = "switch_core_session_run"
#5  0xb7e9c765 in switch_core_session_thread (thread=0x9ada840, obj=0x95fe270) at src/switch_core_session.c:853
        session = (switch_core_session_t *) 0x95fe270
        event = <value optimized out>
        event_str = 0x0
        val = <value optimized out>
        __func__ = "switch_core_session_thread"
        __PRETTY_FUNCTION__ = "switch_core_session_thread"
#6  0xb7efd916 in dummy_worker (opaque=0x9ada840) at threadproc/unix/thread.c:138
No locals.
#7  0xb7e034fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
No symbol table info available.
#8  0xb7d55e5e in clone () from /lib/tls/i686/cmov/libc.so.6





More information about the FreeSWITCH-users mailing list