[Freeswitch-users] High cps load causes weird cpu and memory starvation. Need suggestions on how to debug.
bratner bratner
ratner2 at gmail.com
Thu Mar 7 14:09:46 MSK 2013
Hi!
Please assist me to find a way to get some insight into this problem and
file a proper bug report. Looks like 2 different problems, one is load
related (like a race of some sort) the other is a mem leak.
I'm running FS debs compiled on ubuntu 12.04 (-n20130227T114536Z-1~presice+
1git).
The machine is ubuntu 12.04 server , CPU: Intel(R) Xeon(R) CPU E5620 @
2.40GHz, 4GB RAM and no swap. 3.1G RAM free after FS is started.
I disabled HT in BIOS to get proper CPU load readings from top and vmstat
(no HT also allows for higher cps for some reason...).
The FS setup has 2 SIP profiles , internal/external (functioning as
incoming/outgoing ), XML Dialplan that runs a lua script.
Calls come in to profile 'internal', authed by ip acl , go to the dialplan
and sent through profile 'external' to a uri string generated by the lua
script.
I bombard the machine with high call rate of short calls ,120-180cps , 5sec.
Everything stays fine for some time , the higher the load the faster the
problems appear.
When i pressure the machine with 150cps and up , it works fine at first
with 20%-30% idle CPU.
After some time passes (10-30minutes) system cpu usage jumps from 3%-7% to
around 50%, CPU is 95% busy and vmstat reports hundereds and sometimes
thousands of runnable tasks.
If i stop sending calls at this moment, wait for the CPU load to wind down
and start calling again then the CPU load saturates right away. Even if i
lower the CPS rate.
restarting freeswitch at this point brings the system to the original state
and it will take some time to get saturated again.
Another symptom is that during the test the available free mem is slowly
allocated to freeswitch until it reaches 13.1Gb virtual image and then the
kernel kills it.
Here are some numbers:
Running 120 calls per second , 5second calls:
Starts like this: This is very good from my POV.
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy id
wa
5 0 0 3255876 35444 96580 0 0 9 2 66 90 1 1
98 0
4 0 0 3253080 35444 96652 0 0 0 0 16048 20727 36 3
61 0
4 0 0 3247780 35444 96684 0 0 0 0 16521 21048 36 4
60 0
4 0 0 3242860 35444 96784 0 0 0 0 15695 19825 36 3
62 0
1 0 0 3237980 35452 96768 0 0 0 24 16072 19851 36 3
60 0
4 0 0 3233692 35452 96812 0 0 0 0 16299 20980 35 4
61 0
8 0 0 3230280 35452 96928 0 0 0 0 15034 18965 36 3
61 0
2 0 0 3227764 35452 96896 0 0 0 0 13956 20621 37 3
60 0
After some (long) time it started having trouble and eat lots of free ram
(note that the idle cpu% and system cpu% jumped up but there are still
over 1G free mem available) :
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy id
wa
22 0 0 1108588 100304 2034628 0 0 0 0 145350 183629 37
53 10 0
5 0 0 1108224 100304 2034668 0 0 0 0 151937 167249 38
57 5 0
11 0 0 1108268 100304 2034904 0 0 0 0 149931 176143 37
58 5 0
7 0 0 1108244 100312 2034752 0 0 0 32 138612 183583 40
47 14 0
3 0 0 1108056 100312 2034792 0 0 0 0 111446 162733 42
29 29 0
4 0 0 1108064 100312 2034836 0 0 0 0 112568 164187 41
30 29 0
14 0 0 1108040 100312 2034880 0 0 0 0 133372 174676 39
45 17 0
10 0 0 1107468 100312 2034920 0 0 0 0 157371 184888 37
58 5 0
Short time after that is got into really serious trouble (3000 runnable
tasks!? heading toward memory exhaustion):
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy id
wa
3336 0 0 146580 9248 1457016 0 0 0 0 306642 291819 28
67 5 0
1180 0 0 145060 9220 1430212 0 0 248 848 294654 285273 27
68 5 0
1028 0 0 148616 8996 1366844 0 0 0 0 421614 385342 32
63 4 0
1026 0 0 150964 8880 1298832 0 0 0 28 462825 416480 31
65 4 0
1338 0 0 146260 8688 1242440 0 0 0 0 442247 391139 27
69 5 0
1945 0 0 143148 8680 1197180 0 0 0 0 364361 329958 24
71 5 0
1381 0 0 154668 8596 1137912 0 0 0 0 359982 323543 23
72 5 0
After UAC completed transactions 5198242/ UAS completed transactions
5512619 calls FS crashed and it looked like this:
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy id
wa
3 2 0 92972 568 20492 0 0 156 8 117314 130220 7 40
38 15 <-- back in the day i would kill for 92Mb free mem.
10 5 0 3473328 208 9848 0 0 24344 48 104557 89011 5 62
15 17 <- Now the kernel kills FS for taking the rest of it.
8 0 0 3482436 340 14408 0 0 3208 0 2582 2046 0 95
4 1
9 0 0 3504792 348 14344 0 0 0 48 2312 1836 0 93
7 0
12 0 0 3532184 348 14348 0 0 0 0 2387 1984 0 94
6 0
6 0 0 3561792 348 14876 0 0 528 0 2523 2178 0 93
6 0
11 0 0 3592816 348 14876 0 0 0 0 2684 2248 0 95
5 0
3 0 0 3615424 348 14876 0 0 0 0 2820 2494 0 95
5 0
7 0 0 3633488 348 14876 0 0 0 0 2990 2805 0 94
6 0
7 0 0 3654916 348 14876 0 0 0 0 3350 3222 0 96
4 0
11 0 0 3680708 348 14876 0 0 0 0 3869 4018 0 94
6 0
0 0 0 3773532 476 16876 0 0 2344 0 6598 9501 0 74
24 2
0 0 0 3773836 476 16912 0 0 0 0 191 129 0 0
100 0 <- FS is not running any more, 100% idle CPU , lots of free ram
0 0 0 3773856 476 17108 0 0 0 0 91 63 0 0
100 0
dmesg of the crash is this:
[48674.576126] Out of memory: Kill process 31552 (freeswitch) score 674 or
sacrifice child
[48674.578188] Killed process 31552 (freeswitch) total-vm:10177480kB,
anon-rss:2710092kB, file-rss:0kB
fs_log switches to the dark side : (there was no disk space problem of
course, i have no idea at which time point in the spiral down this popped
on the console )
2013-03-06 23:27:22.923640 [ERR] switch_core_sqldb.c:579 NATIVE SQL ERR
[database or disk is full]
BEGIN EXCLUSIVE
2013-03-06 23:27:22.923640 [CRIT] switch_core_sqldb.c:1679 ERROR [database
or disk is full]
2013-03-06 23:27:22.923640 [ERR] switch_core_sqldb.c:579 NATIVE SQL ERR
[cannot commit - no transaction is active]
COMMIT
2013-03-06 23:27:22.963609 [CRIT] switch_core_session.c:1674 Thread Failure!
2013-03-06 23:27:22.963609 [CRIT] switch_core_session.c:1634 LUKE: I'm hit,
but not bad.
2013-03-06 23:27:22.963609 [CRIT] switch_core_session.c:1635 LUKE'S VOICE:
Artoo, see what you can do with it. Hang on back there....
Green laserfire moves past the beeping little robot as his head turns.
After a few beeps and a twist of his mechanical arm,
Artoo reduces the max sessions to 13897 thus, saving the switch from
certain doom.
and thats all , FS is dead.
Runing the test at 180 cps will significantly hasten the death.
It starts normal with ~40% idle cpu
then after circa 160k calls it starts to spiral:
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy id
wa
6 0 0 2901484 63204 161132 0 0 0 0 20586 33073 54 5
41 0
5 0 0 2901468 63212 160988 0 0 0 12 25359 41915 53 6
42 0
4 0 0 2901220 63212 161064 0 0 0 0 19650 31211 55 5
40 0
7 0 0 2901080 63212 161124 0 0 0 0 21148 33462 54 5
41 0
2 0 0 2900676 63212 161188 0 0 0 0 21767 35568 54 5
42 0
5 0 0 2900468 63212 161256 0 0 0 0 19398 30073 54 5
41 0
2 0 0 2900088 63220 161304 0 0 0 12 21315 34563 54 5
42 0
3 0 0 2900180 63220 161484 0 0 0 0 19858 31561 42 4
54 0 <-- this should not happen the load is constant and idle cpu%
should stay at ~40%
3 0 0 2899768 63220 161520 0 0 0 0 29907 46883 39 4
57 0
11 0 0 2900056 63220 161504 0 0 0 0 41730 63440 41 5
54 0
9 0 0 2899536 63220 161768 0 0 0 0 48876 68159 40 5
54 0
6 0 0 2899512 63228 161620 0 0 0 12 59988 80421 41 6
53 0
49 0 0 2876072 63228 162004 0 0 0 0 193384 241453 57
23 21 0 <-- here we go, starting the memory hogging cpu eating spiral
40 0 0 2830720 63228 162080 0 0 0 0 308948 356178 51
49 0 0
156 0 0 2776576 63228 161984 0 0 0 0 301514 325415 48
48 5 0
178 0 0 2727044 63228 162260 0 0 0 0 290168 306158 44
51 5 0
70 0 0 2683252 63236 161924 0 0 0 48 291792 302632 37
58 5 0
5 0 0 2618432 63236 162204 0 0 0 0 316170 302886 35
59 7 0
2 0 0 2548092 63236 162364 0 0 0 0 340707 319105 38
56 5 0
2 0 0 2460528 63236 162464 0 0 0 1596 386953 348210 40
54 6 0
Nothing on fs warn/err log at this point until memory is all full.
Please give me some ideas on how to get some useful info for a bug report.
Kind Regards,
Boris Ratner.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-users/attachments/20130307/6697a670/attachment-0001.html
Join us at ClueCon 2011 Aug 9-11, 2011
More information about the FreeSWITCH-users
mailing list