[Freeswitch-users] High Availability Cluster Module for FreeSWITCH

Wed Feb 13 03:12:41 MSK 2013

On Tue, Feb 12, 2013 at 3:18 PM, Eliot Gable <egable+freeswitch at gmail.com>wrote:

> On Tue, Feb 12, 2013 at 5:21 PM, Marcin Gozdalik <gozdal at gmail.com> wrote:
>
>> I did some, I got some anecdotes from people running them. It
>> certainly is possible to handle *some* failure cases, but there are
>> some *other* cases that will be not handled correctly.
>>
>>
> I have actually used Sonus, Genband, both of which seamlessly and
> gracefully handle failures. In fact, I cannot think of a single time where
> either of them failed to migrate calls over to an alternative system when
> the one handling calls had even one thing go wrong which prevented them
> from handling calls properly. And, as I mentioned earlier, I have also used
> P + C to build a multi-city, 6-node FS cluster (three iterations of it,
> actually) which could do essentially the same thing. While I don't have the
> resource agent I wrote for that cluster which handled the vast majority of
> failure conditions (including the two I mentioned earlier where you turn
> off ARP on the interface or set the speed/duplex wrong), I could fairly
> easily update my newly written resource agent to cover such scenarios (if
> someone wanted to pay for my time to do it).
>
>
>> Personally I've observed OpenSIPS that run out of some internal memory
>> and anyhow you wanted to monitor it it will reply it is alive. It will
>> even route your simple call scenarios well! Unfortunately the "real"
>> calls were *usually* timing out, but always. Sorry, but I can't
>> imagine how you can automatically handle that, i.e. discover that this
>> node is "bad" and fail-over to some other. Even if you do that
>> correctly your monitoring tool can check that this node is operating
>> perfectly well (because after all the traffic is diverted from the
>> faulty node it begins to work well) and will want to move the traffic
>> back.
>>
>
> Just because you cannot imagine how it would work, it does not mean
> everyone else has the same limitations. You are consistently referring to a
> node (let's call it node A) telling one or more other nodes (B, C,
> whatever) about its state and ability to handle calls. You are taking the
> wrong approach entirely. You cannot have a failed node reliably report it's
> state! What you can do is have Node A broadcast its state many times per
> second. Nodes B, C, etc, all listen for that state. They keep track of the
> state and when they *stop seeing* Node A, they know *something* failed.
> It's that simple.
>
> At this point, it is up to those nodes to determine *what* failed. They
> have to do things like check whether *they* failed, check whether the
> *network* failed, etc. There are very simple and sane ways they can do each
> of those things. Eventually, they can conclude that *Node A* failed. They
> can do this determination very quickly. Think microseconds or maybe 1 - 2
> milliseconds quickly. At this point, nobody tells them what to do. They
> know already what they need to do. They have already pre-negotiated what
> will happen in the event of a failure, or they have very
> specific programmatic instructions on what to do, and so they act
> immediately.
>
>
>> By "impossible" I mean handling all such gray areas. Certainly if
>> you'll power down the box or unplug Ethernet it is possible to migrate
>> live calls to some other box.
>>
>
> There are no such "gray" areas. That is just a fantasy you have.
> Everything in computing is black or white, true or false. If you don't
> know, you test and become certain. If you are prevented from accurately
> testing because (for example) you cannot see the node at all anymore or
> interact with it in any way, you assume the worst and nuke the box
> (STONITH).
>
> Let's say the FS box runs out of memory. Great! I designed mod_ha_cluster
> to cause FS to segfault if it runs out of memory. Heartbeats stopped, other
> node takes over. No gray area. Wait. Did the IP get removed from the box?
> No? Don't know? STONITH. Did that fail? Seriously? You deployed the system
> wrong; don't blame me for your mistakes.
>
> Did the hard drive go away? Great! I have a test for that and a way to
> tell the other nodes I need to be STONITH'd if I cannot reboot myself.
>
> Did FS deadlock? Great! No more heartbeats. Other node takes over. STONITH.
>
> Did a module in FS take an event from the eventing thread and get stuck
> spinning, never to return? Great! No more heartbeats. Other node takes
> over. STONITH.
>
> Did a module in FS launch 128 threads, all of which want to use 100% CPU?
> Great! Untimely heartbeat delivery, other node takes over. STONITH.
>
> Did your dual-router network have the connection between the two routers
> go down leaving you with a split network? Great! If you have that secondary
> network I talked about, it's all properly detected and handled for you! If
> not, well, don't blame me for your failures.
>
> Did someone slap a firewall rule on the box and we suddenly cannot accept
> SIP messages? Great! One of the other nodes in the cluster will be sending
> us test SIP traffic on occasion and when we see it doesn't work anymore, we
> shut down and another node takes over.
>
>
>> I'd just like to have HA that works everytime and everywhere and try
>>
>
> So does everyone else. That is why I want to write mod_ha_cluster. Because
> what is there right now is overly complex, difficult to configure and test,
> and does not and cannot catch all of the possible ways in which a FS system
> can fail.
>
>
>
> +1 again. If you have reservations about what Eliot is doing then great -
offer a valid alternative or ask a valid question. However, let's not make
assumptions about what can and cannot be accomplished when the discussion
is barely 3 days old. Eliot has answered a lot of questions here and has
also done a good job of eliminating the FUD while demystifying what happens
in a mod_ha_cluster.

Be sure to join us tomorrow on the FreeSWITCH conference
call<http://wiki.freeswitch.org/wiki/FS_weekly_2013_02_13>and we'll
talk about how to move forward and who wants to be part of the
discussion.

-MC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-users/attachments/20130212/370b7284/attachment.html