[Freeswitch-users] High Availability Cluster Module for FreeSWITCH

Steven Ayre steveayre at gmail.com
Wed Feb 13 13:47:43 MSK 2013


This is something where you'll need to get a physical rack and install your
own setup. It's too specialised for you to find anyone offering it as
standard.

You're probably going to want them to be pretty close together connected
via a dedicated LAN anyway - otherwise packet loss / routing loops tin/
latency / jitter will mean your heartbeats start failing.

Similarly you don't want anything stealing your CPU away preventing you
from sending heartbeats at the correct time, so VPS options like Amazon AWS
are not going to be appropriate either.

And you're going to want STONITH support too (eg via power sockets that're
controllable via IP) - it's unlikely a dedicated server from someone like
rackspace will offer that and shutting down virtual machine instances
aren't reliable enough for STONITH (the guest could be down because of a
fault on the host which might leave you unable to shutdown the guest).

-Steve



On 12 February 2013 23:59, Avi Marcus <avi at avimarcus.net> wrote:

> Eliot, for those of us not running our own network, which
> datacenters/colos  offer multiple networks/nics?
> e.g. amazon, rackspace, linode, softlayer..?
>
> It's not something I recall seeing mentioned...
>
> -Avi
>
> On Wed, Feb 13, 2013 at 1:18 AM, Eliot Gable <egable+freeswitch at gmail.com>wrote:
>
>> On Tue, Feb 12, 2013 at 5:21 PM, Marcin Gozdalik <gozdal at gmail.com>wrote:
>>
>>> I did some, I got some anecdotes from people running them. It
>>> certainly is possible to handle *some* failure cases, but there are
>>> some *other* cases that will be not handled correctly.
>>>
>>>
>> I have actually used Sonus, Genband, both of which seamlessly and
>> gracefully handle failures. In fact, I cannot think of a single time where
>> either of them failed to migrate calls over to an alternative system when
>> the one handling calls had even one thing go wrong which prevented them
>> from handling calls properly. And, as I mentioned earlier, I have also used
>> P + C to build a multi-city, 6-node FS cluster (three iterations of it,
>> actually) which could do essentially the same thing. While I don't have the
>> resource agent I wrote for that cluster which handled the vast majority of
>> failure conditions (including the two I mentioned earlier where you turn
>> off ARP on the interface or set the speed/duplex wrong), I could fairly
>> easily update my newly written resource agent to cover such scenarios (if
>> someone wanted to pay for my time to do it).
>>
>>
>>> Personally I've observed OpenSIPS that run out of some internal memory
>>> and anyhow you wanted to monitor it it will reply it is alive. It will
>>> even route your simple call scenarios well! Unfortunately the "real"
>>> calls were *usually* timing out, but always. Sorry, but I can't
>>> imagine how you can automatically handle that, i.e. discover that this
>>> node is "bad" and fail-over to some other. Even if you do that
>>> correctly your monitoring tool can check that this node is operating
>>> perfectly well (because after all the traffic is diverted from the
>>> faulty node it begins to work well) and will want to move the traffic
>>> back.
>>>
>>
>> Just because you cannot imagine how it would work, it does not mean
>> everyone else has the same limitations. You are consistently referring to a
>> node (let's call it node A) telling one or more other nodes (B, C,
>> whatever) about its state and ability to handle calls. You are taking the
>> wrong approach entirely. You cannot have a failed node reliably report it's
>> state! What you can do is have Node A broadcast its state many times per
>> second. Nodes B, C, etc, all listen for that state. They keep track of the
>> state and when they *stop seeing* Node A, they know *something* failed.
>> It's that simple.
>>
>> At this point, it is up to those nodes to determine *what* failed. They
>> have to do things like check whether *they* failed, check whether the
>> *network* failed, etc. There are very simple and sane ways they can do each
>> of those things. Eventually, they can conclude that *Node A* failed. They
>> can do this determination very quickly. Think microseconds or maybe 1 - 2
>> milliseconds quickly. At this point, nobody tells them what to do. They
>> know already what they need to do. They have already pre-negotiated what
>> will happen in the event of a failure, or they have very
>> specific programmatic instructions on what to do, and so they act
>> immediately.
>>
>>
>>> By "impossible" I mean handling all such gray areas. Certainly if
>>> you'll power down the box or unplug Ethernet it is possible to migrate
>>> live calls to some other box.
>>>
>>
>> There are no such "gray" areas. That is just a fantasy you have.
>> Everything in computing is black or white, true or false. If you don't
>> know, you test and become certain. If you are prevented from accurately
>> testing because (for example) you cannot see the node at all anymore or
>> interact with it in any way, you assume the worst and nuke the box
>> (STONITH).
>>
>> Let's say the FS box runs out of memory. Great! I designed mod_ha_cluster
>> to cause FS to segfault if it runs out of memory. Heartbeats stopped, other
>> node takes over. No gray area. Wait. Did the IP get removed from the box?
>> No? Don't know? STONITH. Did that fail? Seriously? You deployed the system
>> wrong; don't blame me for your mistakes.
>>
>> Did the hard drive go away? Great! I have a test for that and a way to
>> tell the other nodes I need to be STONITH'd if I cannot reboot myself.
>>
>> Did FS deadlock? Great! No more heartbeats. Other node takes over.
>> STONITH.
>>
>> Did a module in FS take an event from the eventing thread and get stuck
>> spinning, never to return? Great! No more heartbeats. Other node takes
>> over. STONITH.
>>
>> Did a module in FS launch 128 threads, all of which want to use 100% CPU?
>> Great! Untimely heartbeat delivery, other node takes over. STONITH.
>>
>> Did your dual-router network have the connection between the two routers
>> go down leaving you with a split network? Great! If you have that secondary
>> network I talked about, it's all properly detected and handled for you! If
>> not, well, don't blame me for your failures.
>>
>> Did someone slap a firewall rule on the box and we suddenly cannot accept
>> SIP messages? Great! One of the other nodes in the cluster will be sending
>> us test SIP traffic on occasion and when we see it doesn't work anymore, we
>> shut down and another node takes over.
>>
>>
>>> I'd just like to have HA that works everytime and everywhere and try
>>>
>>
>> So does everyone else. That is why I want to write mod_ha_cluster.
>> Because what is there right now is overly complex, difficult to configure
>> and test, and does not and cannot catch all of the possible ways in which a
>> FS system can fail.
>>
>>
>>
>> _________________________________________________________________________
>> Professional FreeSWITCH Consulting Services:
>> consulting at freeswitch.org
>> http://www.freeswitchsolutions.com
>>
>> 
>> 
>>
>> Official FreeSWITCH Sites
>> http://www.freeswitch.org
>> http://wiki.freeswitch.org
>> http://www.cluecon.com
>>
>> FreeSWITCH-users mailing list
>> FreeSWITCH-users at lists.freeswitch.org
>> http://lists.freeswitch.org/mailman/listinfo/freeswitch-users
>> UNSUBSCRIBE:http://lists.freeswitch.org/mailman/options/freeswitch-users
>> http://www.freeswitch.org
>>
>>
>
> _________________________________________________________________________
> Professional FreeSWITCH Consulting Services:
> consulting at freeswitch.org
> http://www.freeswitchsolutions.com
>
> 
> 
>
> Official FreeSWITCH Sites
> http://www.freeswitch.org
> http://wiki.freeswitch.org
> http://www.cluecon.com
>
> FreeSWITCH-users mailing list
> FreeSWITCH-users at lists.freeswitch.org
> http://lists.freeswitch.org/mailman/listinfo/freeswitch-users
> UNSUBSCRIBE:http://lists.freeswitch.org/mailman/options/freeswitch-users
> http://www.freeswitch.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-users/attachments/20130213/214cc0b0/attachment.html 


Join us at ClueCon 2011 Aug 9-11, 2011
More information about the FreeSWITCH-users mailing list