[Freeswitch-users] Testing Freeswitch performance led to strange behavior

Thu Jun 4 07:32:22 PDT 2009

In the process of trying to use Freeswitch in a production
environment I conducted a number of performance tests using
various servers. It was then that I noticed some strange behavior
from FS. When I stripped down the scenario I was using to a simple
bridge scenario, I stumbled upon a strange behavior.

The scenario as I stated is quite simple.

|---------|                              |---------|
|         |------- Call from sipp------> |         |
|  sipp   |                              |   FS    |
|         | <------ Call back to sipp----|         |
|---------|                              |---------|

I did not use an RTP stream for my calls just to test
the signaling alone.

The sipp scenario is the standard uac.xml scenario that
can be found integrated to sipp with the following options :

Test FS 1:

sipp <FS_IP>:5060 -s 55555555 -i <SIPP_IP> -mi <SIPP_IP> -ci <SIPP_IP> 
-r 10 -d 5000 -l 100 -m 1000 -sf uac.xml

Calls                 : 1000
Successful calls      : 1000
Idle CPU during tests : ~(35-60) % (35 during the generation of new 
calls, 60 during the -l limit imposed by the test)

Note : 985 of them had a duration (billsec) of 10 and 15 of them had a 
duration of 11.

I tried raising the call rate and limit...

Test FS 1:

sipp <FS_IP>:5060 -s 55555555 -i <SIPP_IP> -mi <SIPP_IP> -ci <SIPP_IP> 
-r 20 -d 5000 -l 200 -m 1000 -sf uac.xml

Calls                 : 1000
Successful calls      : 1000
Idle CPU during tests : ~0-30 % (0 during the generation of new calls, 
30 during the -l limit imposed by the test)

THIS IS WHAT MAKES ME WONDER :

The distribution of the durations (billsec - not complete durations) :

    183  calls with 10 secs billed duration
    110  calls with 11 secs billed duration
    238  calls with 12 secs billed duration
    447  calls with 13 secs billed duration
     22  calls with 14 secs billed duration

The sipp scenario is simple "hangup the phone after 10 secs". So, why am 
I seeing these? Of course that has something to do with the stress the 
machine
has been put through during the second test. But I can see it happening 
to less stressful conditions (i.e. 15 calls per second) to a smaller extend.

I captured one of these calls and verified that when the sipp client 
hangs up exactly 10 secs after the call start, FS receives the BYE
and replies with 200 OK. BUT it does not hang the second leg in a timely 
manner i.e. it sends a BYE to the sipp server side 1-4 seconds
AFTER that. That explains the 11, 12, 13, 14 secs durations seen on the 
second test. What is more interesting is that I would expect to see in
the CDRs  the first and second leg to have different durations (since 
the a leg BYE was received and aknowledged by FS in the correct time)
i.e. 10 and 14 secs accordingly. But what I get is the same duration for 
both legs (14 secs).

This in my opinion is very dangerous on production environments (you get 
charged by your provider more seconds that you charge your clients - or 
- you falsely charge your clients with bigger durations although they 
hunged up corectly (and you acknowledged it)).

NOTE No 1 : All the performance recommendations found in the wiki has 
been applied. In fact only the essential modules that could make this 
scenario work
were loaded.

NOTE No 2 : I tried using asterisk (as a point of reference - don't get 
me wrong - I am not trying to start a flame war here). And it succeeded 
doing on the same hardware  60 calls/sec with a channel limit of 400 
sim. calls using only 50% of the cpu (maximum). No under any 
circumstances I have seen the behavior above (this inability to hang 
call legs in a timely manner). Even when I pushed asterisk to the limits 
(80 calls per second 600 max call limit) and it started failing on some 
calls it never failed to hangup the calls for both legs on exactly 10 secs.

NOTE No 3 : As you can tell I was using a very small machine for my 
tests. When I moved the same tests to larger installations (Quad Core 
Opterons and Xeons) I got proportional results to the above.

NOTE No 4 : The tests were performed in a LAN environment and since 
there was no RTP involved I think there were no bandwidth issues there.

NOTE No 5 : The tests were performed using numerous SVN versions (latest 
: 13610), the stable version and the 1.0.4pre8 version.

NOTE No 6 : Using the -hp switch made no noticeable change in behavior.

I am not trying to complain for FS's performance (far from it). I am 
just somewhat disappointed seeing it performing in such a strange manner 
when under stress. I would prefer a design that drops the calls after a 
certain threshold than a design that incorrectly handles them all (I am 
aware of the max sessions per second in switch.conf.cml - but I am 
starting to see this behavior even with the cpu idling at about 80%). I 
don't know if anyone else had the same experience when testing 
Freeswitch. I can happily supply with all the test details (config 
files, captures etc) to all interested parties.

-- 
-------------------------------------------
Apostolos Pantsiopoulos
Kinetix Tele.com R & D
email: regs at kinetix.gr
-------------------------------------------