[Freeswitch-users] High cps load causes weird cpu and memory starvation. Need suggestions on how to debug.

Ken Rice krice at freeswitch.org
Fri Mar 8 04:45:38 MSK 2013


You are probably hammering the disk subsystem... Keep in mind that FS uses
multiple sqlite databases by default... Mount the fs db dir as tmpfs and try
again


On 3/7/13 7:35 PM, "Dmitry Lysenko" <dvl36.ripe.nick at gmail.com> wrote:

> Hm... But what about huge interrupt and context switching  number?
> 
> 
> 2013/3/7 Steven Ayre <steveayre at gmail.com>
>> Upgrade to the latest version.
>> 
>> There are some memory leaks that have been found and fixed in 1.2.7.
>> 
>> -Steve
>> 
>> 
>> 
>> 
>> On 7 March 2013 11:09, bratner bratner <ratner2 at gmail.com> wrote:
>>> Hi!
>>> 
>>> Please assist me to find a way to get some insight into this problem and
>>> file a proper bug report. Looks like 2 different problems, one is load
>>> related (like a race of some sort) the other is a mem leak.
>>> 
>>> 
>>> I'm running FS debs compiled on ubuntu 12.04 (-n20130227T114536Z-1~presice+
>>> 1git).
>>> The machine is ubuntu 12.04 server , CPU: Intel(R) Xeon(R) CPU E5620  @
>>> 2.40GHz, 4GB RAM and no swap. 3.1G RAM free after FS is started.
>>> I disabled HT in BIOS to get proper CPU load readings from top and vmstat
>>> (no HT also allows for higher cps for some reason...).
>>> 
>>> The FS setup has 2 SIP profiles , internal/external (functioning as
>>> incoming/outgoing ), XML Dialplan that runs a lua script.
>>> Calls come in to profile 'internal',  authed by ip acl , go to the dialplan
>>> and sent through profile 'external' to a uri string generated by the lua
>>> script. 
>>> 
>>> I bombard the machine with high call rate of short calls ,120-180cps , 5sec.
>>> Everything stays fine for some time , the higher the load the faster the
>>> problems appear.
>>> When i pressure the machine with 150cps and up , it works fine at first with
>>> 20%-30% idle CPU.
>>> After some time passes (10-30minutes) system cpu usage jumps from 3%-7% to
>>> around 50%, CPU is 95% busy and vmstat reports hundereds and sometimes
>>> thousands of runnable tasks.
>>> If i stop sending calls at this moment, wait for the CPU load to wind down
>>> and start calling again then the CPU load saturates right away. Even if i
>>> lower the CPS rate.
>>> restarting freeswitch at this point brings the system to the original state
>>> and it will take some time to get saturated again.
>>> 
>>> Another symptom is that during the test the available free mem is slowly
>>> allocated to freeswitch until it reaches 13.1Gb virtual image and then the
>>> kernel kills it.
>>> 
>>> Here are some numbers:
>>> 
>>> Running 120 calls per second , 5second calls:
>>> 
>>> Starts like this: This is very good from my POV.
>>> 
>>> procs -----------memory---------- ---swap-- -----io---- -system--
>>> ----cpu----
>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
>>> wa
>>>  5  0      0 3255876  35444  96580    0    0     9     2   66   90  1  1 98 
>>> 0
>>>  4  0      0 3253080  35444  96652    0    0     0     0 16048 20727 36  3
>>> 61  0
>>>  4  0      0 3247780  35444  96684    0    0     0     0 16521 21048 36  4
>>> 60  0
>>>  4  0      0 3242860  35444  96784    0    0     0     0 15695 19825 36  3
>>> 62  0
>>>  1  0      0 3237980  35452  96768    0    0     0    24 16072 19851 36  3
>>> 60  0
>>>  4  0      0 3233692  35452  96812    0    0     0     0 16299 20980 35  4
>>> 61  0
>>>  8  0      0 3230280  35452  96928    0    0     0     0 15034 18965 36  3
>>> 61  0
>>>  2  0      0 3227764  35452  96896    0    0     0     0 13956 20621 37  3
>>> 60  0
>>> 
>>> After some (long) time it started having trouble and eat lots of free ram
>>> (note that the idle cpu% and system cpu%  jumped up but there are still over
>>> 1G free mem available) :
>>> 
>>> procs -----------memory---------- ---swap-- -----io---- -system--
>>> ----cpu----
>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
>>> wa
>>> 22  0      0 1108588 100304 2034628    0    0     0     0 145350 183629 37
>>> 53 10  0
>>>  5  0      0 1108224 100304 2034668    0    0     0     0 151937 167249 38
>>> 57  5  0
>>> 11  0      0 1108268 100304 2034904    0    0     0     0 149931 176143 37
>>> 58  5  0
>>>  7  0      0 1108244 100312 2034752    0    0     0    32 138612 183583 40
>>> 47 14  0
>>>  3  0      0 1108056 100312 2034792    0    0     0     0 111446 162733 42
>>> 29 29  0
>>>  4  0      0 1108064 100312 2034836    0    0     0     0 112568 164187 41
>>> 30 29  0
>>> 14  0      0 1108040 100312 2034880    0    0     0     0 133372 174676 39
>>> 45 17  0
>>> 10  0      0 1107468 100312 2034920    0    0     0     0 157371 184888 37
>>> 58  5  0
>>> 
>>> Short time after that is got into really serious trouble (3000 runnable
>>> tasks!? heading toward memory exhaustion):
>>> 
>>> procs -----------memory---------- ---swap-- -----io---- -system--
>>> ----cpu----
>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
>>> wa
>>> 
>>> 3336  0      0 146580   9248 1457016    0    0     0     0 306642 291819 28
>>> 67  5  0
>>> 1180  0      0 145060   9220 1430212    0    0   248   848 294654 285273 27
>>> 68  5  0
>>> 1028  0      0 148616   8996 1366844    0    0     0     0 421614 385342 32
>>> 63  4  0
>>> 1026  0      0 150964   8880 1298832    0    0     0    28 462825 416480 31
>>> 65  4  0
>>> 1338  0      0 146260   8688 1242440    0    0     0     0 442247 391139 27
>>> 69  5  0
>>> 1945  0      0 143148   8680 1197180    0    0     0     0 364361 329958 24
>>> 71  5  0
>>> 1381  0      0 154668   8596 1137912    0    0     0     0 359982 323543 23
>>> 72  5  0
>>> 
>>> 
>>> After UAC completed transactions 5198242/ UAS completed transactions 5512619
>>> calls FS crashed and it looked like this:
>>> 
>>> procs -----------memory---------- ---swap-- -----io---- -system--
>>> ----cpu----
>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
>>> wa
>>> 
>>> 3  2      0  92972    568  20492    0    0   156     8 117314 130220  7 40
>>> 38 15   <-- back in the day i would kill for 92Mb free mem.
>>> 10  5      0 3473328    208   9848    0    0 24344    48 104557 89011  5 62
>>> 15 17   <- Now the kernel kills FS for taking the rest of it.
>>>  8  0      0 3482436    340  14408    0    0  3208     0 2582 2046  0 95  4 
>>> 1
>>>  9  0      0 3504792    348  14344    0    0     0    48 2312 1836  0 93  7 
>>> 0
>>> 12  0      0 3532184    348  14348    0    0     0     0 2387 1984  0 94  6 
>>> 0
>>>  6  0      0 3561792    348  14876    0    0   528     0 2523 2178  0 93  6 
>>> 0
>>> 11  0      0 3592816    348  14876    0    0     0     0 2684 2248  0 95  5 
>>> 0
>>>  3  0      0 3615424    348  14876    0    0     0     0 2820 2494  0 95  5 
>>> 0
>>>  7  0      0 3633488    348  14876    0    0     0     0 2990 2805  0 94  6 
>>> 0
>>>  7  0      0 3654916    348  14876    0    0     0     0 3350 3222  0 96  4 
>>> 0
>>> 11  0      0 3680708    348  14876    0    0     0     0 3869 4018  0 94  6 
>>> 0
>>>  0  0      0 3773532    476  16876    0    0  2344     0 6598 9501  0 74 24 
>>> 2
>>>  0  0      0 3773836    476  16912    0    0     0     0  191  129  0  0
>>> 100  0   <- FS is not running any more, 100% idle CPU , lots of free ram
>>>  0  0      0 3773856    476  17108    0    0     0     0   91   63  0  0
>>> 100  0 <tel:91%C2%A0%C2%A0%2063%C2%A0%200%C2%A0%200%20100%C2%A0%200>
>>> 
>>> 
>>> 
>>> dmesg of the crash is this:
>>> 
>>> [48674.576126] Out of memory: Kill process 31552 (freeswitch) score 674 or
>>> sacrifice child
>>> [48674.578188] Killed process 31552 (freeswitch) total-vm:10177480kB,
>>> anon-rss:2710092kB, file-rss:0kB
>>> 
>>> fs_log switches to the dark side : (there was no disk space problem of
>>> course, i have no idea at which time point in the spiral down this popped on
>>> the console )
>>> 
>>> 2013-03-06 23:27:22.923640 [ERR] switch_core_sqldb.c:579 NATIVE SQL ERR
>>> [database or disk is full]
>>> BEGIN EXCLUSIVE
>>> 2013-03-06 23:27:22.923640 [CRIT] switch_core_sqldb.c:1679 ERROR [database
>>> or disk is full]
>>> 2013-03-06 23:27:22.923640 [ERR] switch_core_sqldb.c:579 NATIVE SQL ERR
>>> [cannot commit - no transaction is active]
>>> COMMIT
>>> 2013-03-06 23:27:22.963609 [CRIT] switch_core_session.c:1674 Thread Failure!
>>> 2013-03-06 23:27:22.963609 [CRIT] switch_core_session.c:1634 LUKE: I'm hit,
>>> but not bad.
>>> 2013-03-06 23:27:22.963609 [CRIT] switch_core_session.c:1635 LUKE'S VOICE:
>>> Artoo, see what you can do with it. Hang on back there....
>>> Green laserfire moves past the beeping little robot as his head turns. 
>>> After a few beeps and a twist of his mechanical arm,
>>> Artoo reduces the max sessions to 13897 thus, saving the switch from certain
>>> doom.
>>> 
>>> and thats all , FS is dead.
>>> 
>>> Runing the test at 180 cps will significantly hasten the death.
>>> It starts normal with ~40% idle cpu
>>> then after circa 160k calls it starts to spiral:
>>> procs -----------memory---------- ---swap-- -----io---- -system--
>>> ----cpu----
>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
>>> wa
>>>  6  0      0 2901484  63204 161132    0    0     0     0 20586 33073 54  5
>>> 41  0
>>>  5  0      0 2901468  63212 160988    0    0     0    12 25359 41915 53  6
>>> 42  0
>>>  4  0      0 2901220  63212 161064    0    0     0     0 19650 31211 55  5
>>> 40  0
>>>  7  0      0 2901080  63212 161124    0    0     0     0 21148 33462 54  5
>>> 41  0
>>>  2  0      0 2900676  63212 161188    0    0     0     0 21767 35568 54  5
>>> 42  0
>>>  5  0      0 2900468  63212 161256    0    0     0     0 19398 30073 54  5
>>> 41  0
>>>  2  0      0 2900088  63220 161304    0    0     0    12 21315 34563 54  5
>>> 42  0
>>>  3  0      0 2900180  63220 161484    0    0     0     0 19858 31561 42  4
>>> 54  0       <-- this should not happen the load is constant and idle cpu%
>>> should stay at ~40%
>>>  3  0      0 2899768  63220 161520    0    0     0     0 29907 46883 39  4
>>> 57  0
>>> 11  0      0 2900056  63220 161504    0    0     0     0 41730 63440 41  5
>>> 54  0
>>>  9  0      0 2899536  63220 161768    0    0     0     0 48876 68159 40  5
>>> 54  0
>>>  6  0      0 2899512  63228 161620    0    0     0    12 59988 80421 41  6
>>> 53  0
>>> 49  0      0 2876072  63228 162004    0    0     0     0 193384 241453 57 23
>>> 21  0      <-- here we go,  starting the memory hogging cpu eating spiral
>>> 40  0      0 2830720  63228 162080    0    0     0     0 308948 356178 51
>>> 49  0  0
>>> 156  0      0 2776576  63228 161984    0    0     0     0 301514 325415 48
>>> 48  5  0
>>> 178  0      0 2727044  63228 162260    0    0     0     0 290168 306158 44
>>> 51  5  0
>>> 70  0      0 2683252  63236 161924    0    0     0    48 291792 302632 37
>>> 58  5  0
>>>  5  0      0 2618432  63236 162204    0    0     0     0 316170 302886 35
>>> 59  7  0
>>>  2  0      0 2548092  63236 162364    0    0     0     0 340707 319105 38
>>> 56  5  0
>>>  2  0      0 2460528  63236 162464    0    0     0  1596 386953 348210 40
>>> 54  6  0
>>> 
>>> Nothing on fs warn/err log at this point until memory is all full.
>>> 
>>> 
>>> Please give me some ideas on how to get some useful info for a bug report.
>>> 
>>> Kind Regards,
>>> Boris Ratner.
>>> 
>>> _________________________________________________________________________
>>> Professional FreeSWITCH Consulting Services:
>>> consulting at freeswitch.org
>>> http://www.freeswitchsolutions.com
>>> 
>>> 
>>> 
>>> 
>>> Official FreeSWITCH Sites
>>> http://www.freeswitch.org
>>> http://wiki.freeswitch.org
>>> http://www.cluecon.com
>>> 
>>> FreeSWITCH-users mailing list
>>> FreeSWITCH-users at lists.freeswitch.org
>>> http://lists.freeswitch.org/mailman/listinfo/freeswitch-users
>>> UNSUBSCRIBE:http://lists.freeswitch.org/mailman/options/freeswitch-users
>>> http://www.freeswitch.org
>>> 
>> 
>> 
>> _________________________________________________________________________
>> Professional FreeSWITCH Consulting Services:
>> consulting at freeswitch.org
>> http://www.freeswitchsolutions.com
>> 
>> 
>> 
>> 
>> Official FreeSWITCH Sites
>> http://www.freeswitch.org
>> http://wiki.freeswitch.org
>> http://www.cluecon.com
>> 
>> FreeSWITCH-users mailing list
>> FreeSWITCH-users at lists.freeswitch.org
>> http://lists.freeswitch.org/mailman/listinfo/freeswitch-users
>> UNSUBSCRIBE:http://lists.freeswitch.org/mailman/options/freeswitch-users
>> http://www.freeswitch.org
>> 
> 
> 
> 
> _________________________________________________________________________
> Professional FreeSWITCH Consulting Services:
> consulting at freeswitch.org
> http://www.freeswitchsolutions.com
> 
> 
> 
> 
> Official FreeSWITCH Sites
> http://www.freeswitch.org
> http://wiki.freeswitch.org
> http://www.cluecon.com
> 
> FreeSWITCH-users mailing list
> FreeSWITCH-users at lists.freeswitch.org
> http://lists.freeswitch.org/mailman/listinfo/freeswitch-users
> UNSUBSCRIBE:http://lists.freeswitch.org/mailman/options/freeswitch-users
> http://www.freeswitch.org

-- 
Ken
http://www.FreeSWITCH.org
http://www.ClueCon.com
http://www.OSTAG.org
irc.freenode.net #freeswitch

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-users/attachments/20130307/f8c20b28/attachment-0001.html 


Join us at ClueCon 2011 Aug 9-11, 2011
More information about the FreeSWITCH-users mailing list