[Freeswitch-users] Capacity testing, seg fault

tuhl at ix.netcom.com tuhl at ix.netcom.com
Thu Nov 29 12:53:40 PST 2007


I'm running some capacity tests on Freeswitch and can cause 
seg-faults fairly quickly (<1 minute) at a 'light' load of 10 call 
originations per second. Core dump backtrace is at the bottom, and my 
debugging shows what looks like corrupted js_session. I'll open an 
issue on JIRA. I wanted to get opinions on whether this is a valid 
architecture for testing capacity, and whether I'm making a simple mistake.

I have the trunk version installed on 2 servers and have one server 
(the originator) calling the other (the receiver) using SIP, g711, 
with a Gig-E ethernet switch between them. The originating server 
basically does a session.originate, waitForAnswer, streamFile 
(10-second 8khz .wav), and hangup. The receiver does a session.answer 
and a recordFile to a .wav file so I can go back and check voice quality.

My capacity testing engine is a Perl script which is using the RPC 
XML interface to originate the calls on Freeswitch (I submit requests 
to do a 'jsrun play.js' to a certain phone number, where play.js is a 
simple script which originates, waitsforanswer, streamfile, hangup) . 
I can configure it to make a certain number of originations per 
second and a certain number of total calls. I have no Perl script 
running on the receiver - I just setup the dialplan to call a .js 
which answers the call and records it. This testing setup is at an 
early stage, so right now, to check pass/fail, I just verify that if 
I ran a 1000-call test on the originator, there should be 1000 .wav 
files that are all about the same size, on the receiver at the end of 
the test, and no crashes.

I've compiled with debug flags on, and I've set all *_DEBUG flags to 
9 (I have also run the tests after a recompile with debug flags 
off/0, and that didn't make any difference). I've done all the ulimit 
commands that were in the last few emails on this list. I'm running 
on FC6 on a Dell 2850 with dual 3.6ghz Xeons (/proc/cpuinfo shows 4 
processors), and 4G RAM. I've set max-sessions to 3000 and Session 
Rate to 100. 'top' is showing freeswitch at 60-80% on the receiver 
during this test.

My problem right now is on the receiver, which I wouldn't care about 
because I'm most interested in the origination capacity of 
freeswitch, but with my receiver crashing so quickly, I can't push 
the originator very hard. I setup my originating engine to make 1000 
total calls at 10 call originations per second, each call lasting 10 
seconds (which results in about 120 simultaneous channels in use), 
and I get a seg fault on the receiver within about 500 calls or 50 
seconds. If I run a test with 1000 total calls at 6 call originations 
per second, it will work, but if I run an overnight test with 20,000 
total calls at 6 call originations per second, the receiver will 
sometimes seg-fault at around 15,000 calls, and sometimes it will 
not. Interestingly though, if I do a very short but very high rate 
test of 100 total calls at 50 call originations per second, that will 
usually work. But 500 total calls at 50 calls per second will always seg-fault.

Just so you know... I'm shooting for the holy grail of stable 
operation at 100 call originations per second. I know people have 
reported much better results than I'm getting. Is something in my setup bad?

Here's the core dump backtrace. I added some debug printf's in 
session_destroy right before the call to destroy_speech_engine, and 
it looks like the jss has been trampled - for example, jss->flags is 
always 0 for all my successful calls, but right before it seg-faults, 
jss->flags is some large random number. This happens every single time.

Program terminated with signal 11, Segmentation fault.
#0  0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0x40040437 in switch_core_codec_destroy (codec=0x54ece168) at 
#2  0x40ee778b in destroy_speech_engine (jss=0x51206538) at 
#3  0x40eeaa70 in session_destroy (cx=0x549c9920, obj=0x4eeac7f0) at 
#4  0x417d1aa7 in js_FinalizeObject (cx=0x549c9920, obj=0x4eeac7f0) 
at src/jsobj.c:2168
#5  0x417b04d9 in js_GC (cx=0x549c9920, gcflags=0) at src/jsgc.c:1856
#6  0x417af6ad in js_ForceGC (cx=0x549c9920, gcflags=0) at src/jsgc.c:1508
#7  0x417830fd in js_DestroyContext (cx=0x549c9920, 
gcmode=JS_FORCE_GC) at src/jscntxt.c:285
#8  0x417727ac in JS_DestroyContext (cx=0x549c9920) at src/jsapi.c:956
#9  0x40eec3c9 in js_parse_and_execute (session=0x464d9678, 
input_code=0x9e31458 "capacity.js", ro=0x0) at mod_spidermonkey.c:3296
#10 0x40eec3f2 in js_dp_function (session=0x464d9678, data=0x9e31458 
"capacity.js") at mod_spidermonkey.c:3302
#11 0x40044341 in switch_core_session_exec (session=0x464d9678, 
application_interface=0x40f26f80, arg=0x9e31458 "capacity.js")
     at src/switch_core_session.c:936
#12 0x400455be in switch_core_standard_on_execute 
(session=0x464d9678) at src/switch_core_state_machine.c:169
#13 0x40046605 in switch_core_session_run (session=0x464d9678) at 
#14 0x4004381c in switch_core_session_thread (thread=0x9e31288, 
obj=0x464d9678) at src/switch_core_session.c:681
#15 0x4009047c in dummy_worker (opaque=0x9e31288) at 
#16 0x007cb3db in start_thread () from /lib/libpthread.so.0
#17 0x0072506e in clone () from /lib/libc.so.6


tuhl at ix.netcom.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-users/attachments/20071129/2637d342/attachment-0002.html 

More information about the FreeSWITCH-users mailing list