[Freeswitch-users] Capacity testing, seg fault

Thu Nov 29 13:32:15 PST 2007

Tom,

This sounds very interesting.  I'd like to know a few things:

First, if you disable the call recording on the receiving end, will you
still get the segfaults consistently?  Just curious to see what the
receiver does if it only answers the calls and then hangs up without the
extra burden of recording the audio streams.

Second, how hard would it be to have the originator and the receiver
trade places?  If you could, I'd like to see the two machines switch
roles, so that the machine currently acting as the receiver makes the
calls and the machine acting as the originator will now receive calls.
I'm wondering what will happen - will the segfaults stay at the same
machine or will they go over to the new receiver, or will they go away
altogether...?  

I know those are kinda brute force suggestions but they might yield some
interesting information:

If the segfaults occur only on one machine, regardless of whether it's
making or receiving calls then obviously there's something up with that
machine.

If the segfaults always occur at the machine receiving the calls then,
of course, we've got a more interesting issue.

Would you mind putting your setup info, scripts, etc. in to the
pastebin?  Maybe others could try to replicate your symptoms and see
what shakes out.

Thanks for taking the initiative to do this kind of testing.  It will
definitely help FS be a better, more stable product.

-MC

P.S. - I just saw Brian's emails on this thread, so be sure to check his
suggestions as well!

________________________________

From: freeswitch-users-bounces at lists.freeswitch.org
[mailto:freeswitch-users-bounces at lists.freeswitch.org] On Behalf Of
tuhl at ix.netcom.com
Sent: Thursday, November 29, 2007 12:54 PM
To: freeswitch-users at lists.freeswitch.org
Subject: [Freeswitch-users] Capacity testing, seg fault

Hi, 

I'm running some capacity tests on Freeswitch and can cause seg-faults
fairly quickly (<1 minute) at a 'light' load of 10 call originations per
second. Core dump backtrace is at the bottom, and my debugging shows
what looks like corrupted js_session. I'll open an issue on JIRA. I
wanted to get opinions on whether this is a valid architecture for
testing capacity, and whether I'm making a simple mistake.

Environment:
I have the trunk version installed on 2 servers and have one server (the
originator) calling the other (the receiver) using SIP, g711, with a
Gig-E ethernet switch between them. The originating server basically
does a session.originate, waitForAnswer, streamFile (10-second 8khz
.wav), and hangup. The receiver does a session.answer and a recordFile
to a .wav file so I can go back and check voice quality.

My capacity testing engine is a Perl script which is using the RPC XML
interface to originate the calls on Freeswitch (I submit requests to do
a 'jsrun play.js' to a certain phone number, where play.js is a simple
script which originates, waitsforanswer, streamfile, hangup) . I can
configure it to make a certain number of originations per second and a
certain number of total calls. I have no Perl script running on the
receiver - I just setup the dialplan to call a .js which answers the
call and records it. This testing setup is at an early stage, so right
now, to check pass/fail, I just verify that if I ran a 1000-call test on
the originator, there should be 1000 .wav files that are all about the
same size, on the receiver at the end of the test, and no crashes.

I've compiled with debug flags on, and I've set all *_DEBUG flags to 9
(I have also run the tests after a recompile with debug flags off/0, and
that didn't make any difference). I've done all the ulimit commands that
were in the last few emails on this list. I'm running on FC6 on a Dell
2850 with dual 3.6ghz Xeons (/proc/cpuinfo shows 4 processors), and 4G
RAM. I've set max-sessions to 3000 and Session Rate to 100. 'top' is
showing freeswitch at 60-80% on the receiver during this test.

ISSUE: 
My problem right now is on the receiver, which I wouldn't care about
because I'm most interested in the origination capacity of freeswitch,
but with my receiver crashing so quickly, I can't push the originator
very hard. I setup my originating engine to make 1000 total calls at 10
call originations per second, each call lasting 10 seconds (which
results in about 120 simultaneous channels in use), and I get a seg
fault on the receiver within about 500 calls or 50 seconds. If I run a
test with 1000 total calls at 6 call originations per second, it will
work, but if I run an overnight test with 20,000 total calls at 6 call
originations per second, the receiver will sometimes seg-fault at around
15,000 calls, and sometimes it will not. Interestingly though, if I do a
very short but very high rate test of 100 total calls at 50 call
originations per second, that will usually work. But 500 total calls at
50 calls per second will always seg-fault.

Just so you know... I'm shooting for the holy grail of stable operation
at 100 call originations per second. I know people have reported much
better results than I'm getting. Is something in my setup bad?

Here's the core dump backtrace. I added some debug printf's in
session_destroy right before the call to destroy_speech_engine, and it
looks like the jss has been trampled - for example, jss->flags is always
0 for all my successful calls, but right before it seg-faults,
jss->flags is some large random number. This happens every single time.

Program terminated with signal 11, Segmentation fault.
#0  0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0x40040437 in switch_core_codec_destroy (codec=0x54ece168) at
src/switch_core_codec.c:245
#2  0x40ee778b in destroy_speech_engine (jss=0x51206538) at
mod_spidermonkey.c:1652
#3  0x40eeaa70 in session_destroy (cx=0x549c9920, obj=0x4eeac7f0) at
mod_spidermonkey.c:2723
#4  0x417d1aa7 in js_FinalizeObject (cx=0x549c9920, obj=0x4eeac7f0) at
src/jsobj.c:2168
#5  0x417b04d9 in js_GC (cx=0x549c9920, gcflags=0) at src/jsgc.c:1856
#6  0x417af6ad in js_ForceGC (cx=0x549c9920, gcflags=0) at
src/jsgc.c:1508
#7  0x417830fd in js_DestroyContext (cx=0x549c9920, gcmode=JS_FORCE_GC)
at src/jscntxt.c:285
#8  0x417727ac in JS_DestroyContext (cx=0x549c9920) at src/jsapi.c:956
#9  0x40eec3c9 in js_parse_and_execute (session=0x464d9678,
input_code=0x9e31458 "capacity.js", ro=0x0) at mod_spidermonkey.c:3296
#10 0x40eec3f2 in js_dp_function (session=0x464d9678, data=0x9e31458
"capacity.js") at mod_spidermonkey.c:3302
#11 0x40044341 in switch_core_session_exec (session=0x464d9678,
application_interface=0x40f26f80, arg=0x9e31458 "capacity.js")
    at src/switch_core_session.c:936
#12 0x400455be in switch_core_standard_on_execute (session=0x464d9678)
at src/switch_core_state_machine.c:169
#13 0x40046605 in switch_core_session_run (session=0x464d9678) at
src/switch_core_state_machine.c:406
#14 0x4004381c in switch_core_session_thread (thread=0x9e31288,
obj=0x464d9678) at src/switch_core_session.c:681
#15 0x4009047c in dummy_worker (opaque=0x9e31288) at
threadproc/unix/thread.c:138
#16 0x007cb3db in start_thread () from /lib/libpthread.so.0
#17 0x0072506e in clone () from /lib/libc.so.6

Tom

===============
tuhl at ix.netcom.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-users/attachments/20071129/72893376/attachment-0002.html