[Freeswitch-users] High cps load causes weird cpu and memory starvation. Need suggestions on how to debug.

Thu Mar 7 14:09:46 MSK 2013

Hi!

Please assist me to find a way to get some insight into this problem and
file a proper bug report. Looks like 2 different problems, one is load
related (like a race of some sort) the other is a mem leak.

I'm running FS debs compiled on ubuntu 12.04 (-n20130227T114536Z-1~presice+
1git).
The machine is ubuntu 12.04 server , CPU: Intel(R) Xeon(R) CPU E5620  @
2.40GHz, 4GB RAM and no swap. 3.1G RAM free after FS is started.
I disabled HT in BIOS to get proper CPU load readings from top and vmstat
(no HT also allows for higher cps for some reason...).

The FS setup has 2 SIP profiles , internal/external (functioning as
incoming/outgoing ), XML Dialplan that runs a lua script.
Calls come in to profile 'internal',  authed by ip acl , go to the dialplan
and sent through profile 'external' to a uri string generated by the lua
script.

I bombard the machine with high call rate of short calls ,120-180cps , 5sec.
Everything stays fine for some time , the higher the load the faster the
problems appear.
When i pressure the machine with 150cps and up , it works fine at first
with 20%-30% idle CPU.
After some time passes (10-30minutes) system cpu usage jumps from 3%-7% to
around 50%, CPU is 95% busy and vmstat reports hundereds and sometimes
thousands of runnable tasks.
If i stop sending calls at this moment, wait for the CPU load to wind down
and start calling again then the CPU load saturates right away. Even if i
lower the CPS rate.
restarting freeswitch at this point brings the system to the original state
and it will take some time to get saturated again.

Another symptom is that during the test the available free mem is slowly
allocated to freeswitch until it reaches 13.1Gb virtual image and then the
kernel kills it.

Here are some numbers:

Running 120 calls per second , 5second calls:

Starts like this: This is very good from my POV.

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa
 5  0      0 3255876  35444  96580    0    0     9     2   66   90  1  1
98  0
 4  0      0 3253080  35444  96652    0    0     0     0 16048 20727 36  3
61  0
 4  0      0 3247780  35444  96684    0    0     0     0 16521 21048 36  4
60  0
 4  0      0 3242860  35444  96784    0    0     0     0 15695 19825 36  3
62  0
 1  0      0 3237980  35452  96768    0    0     0    24 16072 19851 36  3
60  0
 4  0      0 3233692  35452  96812    0    0     0     0 16299 20980 35  4
61  0
 8  0      0 3230280  35452  96928    0    0     0     0 15034 18965 36  3
61  0
 2  0      0 3227764  35452  96896    0    0     0     0 13956 20621 37  3
60  0

After some (long) time it started having trouble and eat lots of free ram
(note that the idle cpu% and system cpu%  jumped up but there are still
over 1G free mem available) :

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa
22  0      0 1108588 100304 2034628    0    0     0     0 145350 183629 37
53 10  0
 5  0      0 1108224 100304 2034668    0    0     0     0 151937 167249 38
57  5  0
11  0      0 1108268 100304 2034904    0    0     0     0 149931 176143 37
58  5  0
 7  0      0 1108244 100312 2034752    0    0     0    32 138612 183583 40
47 14  0
 3  0      0 1108056 100312 2034792    0    0     0     0 111446 162733 42
29 29  0
 4  0      0 1108064 100312 2034836    0    0     0     0 112568 164187 41
30 29  0
14  0      0 1108040 100312 2034880    0    0     0     0 133372 174676 39
45 17  0
10  0      0 1107468 100312 2034920    0    0     0     0 157371 184888 37
58  5  0

Short time after that is got into really serious trouble (3000 runnable
tasks!? heading toward memory exhaustion):

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa

3336  0      0 146580   9248 1457016    0    0     0     0 306642 291819 28
67  5  0
1180  0      0 145060   9220 1430212    0    0   248   848 294654 285273 27
68  5  0
1028  0      0 148616   8996 1366844    0    0     0     0 421614 385342 32
63  4  0
1026  0      0 150964   8880 1298832    0    0     0    28 462825 416480 31
65  4  0
1338  0      0 146260   8688 1242440    0    0     0     0 442247 391139 27
69  5  0
1945  0      0 143148   8680 1197180    0    0     0     0 364361 329958 24
71  5  0
1381  0      0 154668   8596 1137912    0    0     0     0 359982 323543 23
72  5  0

After UAC completed transactions 5198242/ UAS completed transactions
5512619 calls FS crashed and it looked like this:

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa

3  2      0  92972    568  20492    0    0   156     8 117314 130220  7 40
38 15   <-- back in the day i would kill for 92Mb free mem.
10  5      0 3473328    208   9848    0    0 24344    48 104557 89011  5 62
15 17   <- Now the kernel kills FS for taking the rest of it.
 8  0      0 3482436    340  14408    0    0  3208     0 2582 2046  0 95
4  1
 9  0      0 3504792    348  14344    0    0     0    48 2312 1836  0 93
7  0
12  0      0 3532184    348  14348    0    0     0     0 2387 1984  0 94
6  0
 6  0      0 3561792    348  14876    0    0   528     0 2523 2178  0 93
6  0
11  0      0 3592816    348  14876    0    0     0     0 2684 2248  0 95
5  0
 3  0      0 3615424    348  14876    0    0     0     0 2820 2494  0 95
5  0
 7  0      0 3633488    348  14876    0    0     0     0 2990 2805  0 94
6  0
 7  0      0 3654916    348  14876    0    0     0     0 3350 3222  0 96
4  0
11  0      0 3680708    348  14876    0    0     0     0 3869 4018  0 94
6  0
 0  0      0 3773532    476  16876    0    0  2344     0 6598 9501  0 74
24  2
 0  0      0 3773836    476  16912    0    0     0     0  191  129  0  0
100  0   <- FS is not running any more, 100% idle CPU , lots of free ram
 0  0      0 3773856    476  17108    0    0     0     0   91   63  0  0
100  0

dmesg of the crash is this:

[48674.576126] Out of memory: Kill process 31552 (freeswitch) score 674 or
sacrifice child
[48674.578188] Killed process 31552 (freeswitch) total-vm:10177480kB,
anon-rss:2710092kB, file-rss:0kB

fs_log switches to the dark side : (there was no disk space problem of
course, i have no idea at which time point in the spiral down this popped
on the console )

2013-03-06 23:27:22.923640 [ERR] switch_core_sqldb.c:579 NATIVE SQL ERR
[database or disk is full]
BEGIN EXCLUSIVE
2013-03-06 23:27:22.923640 [CRIT] switch_core_sqldb.c:1679 ERROR [database
or disk is full]
2013-03-06 23:27:22.923640 [ERR] switch_core_sqldb.c:579 NATIVE SQL ERR
[cannot commit - no transaction is active]
COMMIT
2013-03-06 23:27:22.963609 [CRIT] switch_core_session.c:1674 Thread Failure!
2013-03-06 23:27:22.963609 [CRIT] switch_core_session.c:1634 LUKE: I'm hit,
but not bad.
2013-03-06 23:27:22.963609 [CRIT] switch_core_session.c:1635 LUKE'S VOICE:
Artoo, see what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns.
After a few beeps and a twist of his mechanical arm,
Artoo reduces the max sessions to 13897 thus, saving the switch from
certain doom.

and thats all , FS is dead.

Runing the test at 180 cps will significantly hasten the death.
It starts normal with ~40% idle cpu
then after circa 160k calls it starts to spiral:
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa
 6  0      0 2901484  63204 161132    0    0     0     0 20586 33073 54  5
41  0
 5  0      0 2901468  63212 160988    0    0     0    12 25359 41915 53  6
42  0
 4  0      0 2901220  63212 161064    0    0     0     0 19650 31211 55  5
40  0
 7  0      0 2901080  63212 161124    0    0     0     0 21148 33462 54  5
41  0
 2  0      0 2900676  63212 161188    0    0     0     0 21767 35568 54  5
42  0
 5  0      0 2900468  63212 161256    0    0     0     0 19398 30073 54  5
41  0
 2  0      0 2900088  63220 161304    0    0     0    12 21315 34563 54  5
42  0
 3  0      0 2900180  63220 161484    0    0     0     0 19858 31561 42  4
54  0       <-- this should not happen the load is constant and idle cpu%
should stay at ~40%
 3  0      0 2899768  63220 161520    0    0     0     0 29907 46883 39  4
57  0
11  0      0 2900056  63220 161504    0    0     0     0 41730 63440 41  5
54  0
 9  0      0 2899536  63220 161768    0    0     0     0 48876 68159 40  5
54  0
 6  0      0 2899512  63228 161620    0    0     0    12 59988 80421 41  6
53  0
49  0      0 2876072  63228 162004    0    0     0     0 193384 241453 57
23 21  0      <-- here we go,  starting the memory hogging cpu eating spiral
40  0      0 2830720  63228 162080    0    0     0     0 308948 356178 51
49  0  0
156  0      0 2776576  63228 161984    0    0     0     0 301514 325415 48
48  5  0
178  0      0 2727044  63228 162260    0    0     0     0 290168 306158 44
51  5  0
70  0      0 2683252  63236 161924    0    0     0    48 291792 302632 37
58  5  0
 5  0      0 2618432  63236 162204    0    0     0     0 316170 302886 35
59  7  0
 2  0      0 2548092  63236 162364    0    0     0     0 340707 319105 38
56  5  0
 2  0      0 2460528  63236 162464    0    0     0  1596 386953 348210 40
54  6  0

Nothing on fs warn/err log at this point until memory is all full.

Please give me some ideas on how to get some useful info for a bug report.

Kind Regards,
Boris Ratner.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-users/attachments/20130307/6697a670/attachment-0001.html