[Freeswitch-users] High Availability Cluster Module for FreeSWITCH

Eliot Gable egable+freeswitch at gmail.com
Wed Feb 13 02:18:49 MSK 2013


On Tue, Feb 12, 2013 at 5:21 PM, Marcin Gozdalik <gozdal at gmail.com> wrote:

> I did some, I got some anecdotes from people running them. It
> certainly is possible to handle *some* failure cases, but there are
> some *other* cases that will be not handled correctly.
>
>
I have actually used Sonus, Genband, both of which seamlessly and
gracefully handle failures. In fact, I cannot think of a single time where
either of them failed to migrate calls over to an alternative system when
the one handling calls had even one thing go wrong which prevented them
from handling calls properly. And, as I mentioned earlier, I have also used
P + C to build a multi-city, 6-node FS cluster (three iterations of it,
actually) which could do essentially the same thing. While I don't have the
resource agent I wrote for that cluster which handled the vast majority of
failure conditions (including the two I mentioned earlier where you turn
off ARP on the interface or set the speed/duplex wrong), I could fairly
easily update my newly written resource agent to cover such scenarios (if
someone wanted to pay for my time to do it).


> Personally I've observed OpenSIPS that run out of some internal memory
> and anyhow you wanted to monitor it it will reply it is alive. It will
> even route your simple call scenarios well! Unfortunately the "real"
> calls were *usually* timing out, but always. Sorry, but I can't
> imagine how you can automatically handle that, i.e. discover that this
> node is "bad" and fail-over to some other. Even if you do that
> correctly your monitoring tool can check that this node is operating
> perfectly well (because after all the traffic is diverted from the
> faulty node it begins to work well) and will want to move the traffic
> back.
>

Just because you cannot imagine how it would work, it does not mean
everyone else has the same limitations. You are consistently referring to a
node (let's call it node A) telling one or more other nodes (B, C,
whatever) about its state and ability to handle calls. You are taking the
wrong approach entirely. You cannot have a failed node reliably report it's
state! What you can do is have Node A broadcast its state many times per
second. Nodes B, C, etc, all listen for that state. They keep track of the
state and when they *stop seeing* Node A, they know *something* failed.
It's that simple.

At this point, it is up to those nodes to determine *what* failed. They
have to do things like check whether *they* failed, check whether the
*network* failed, etc. There are very simple and sane ways they can do each
of those things. Eventually, they can conclude that *Node A* failed. They
can do this determination very quickly. Think microseconds or maybe 1 - 2
milliseconds quickly. At this point, nobody tells them what to do. They
know already what they need to do. They have already pre-negotiated what
will happen in the event of a failure, or they have very
specific programmatic instructions on what to do, and so they act
immediately.


> By "impossible" I mean handling all such gray areas. Certainly if
> you'll power down the box or unplug Ethernet it is possible to migrate
> live calls to some other box.
>

There are no such "gray" areas. That is just a fantasy you have. Everything
in computing is black or white, true or false. If you don't know, you test
and become certain. If you are prevented from accurately testing because
(for example) you cannot see the node at all anymore or interact with it in
any way, you assume the worst and nuke the box (STONITH).

Let's say the FS box runs out of memory. Great! I designed mod_ha_cluster
to cause FS to segfault if it runs out of memory. Heartbeats stopped, other
node takes over. No gray area. Wait. Did the IP get removed from the box?
No? Don't know? STONITH. Did that fail? Seriously? You deployed the system
wrong; don't blame me for your mistakes.

Did the hard drive go away? Great! I have a test for that and a way to tell
the other nodes I need to be STONITH'd if I cannot reboot myself.

Did FS deadlock? Great! No more heartbeats. Other node takes over. STONITH.

Did a module in FS take an event from the eventing thread and get stuck
spinning, never to return? Great! No more heartbeats. Other node takes
over. STONITH.

Did a module in FS launch 128 threads, all of which want to use 100% CPU?
Great! Untimely heartbeat delivery, other node takes over. STONITH.

Did your dual-router network have the connection between the two routers go
down leaving you with a split network? Great! If you have that secondary
network I talked about, it's all properly detected and handled for you! If
not, well, don't blame me for your failures.

Did someone slap a firewall rule on the box and we suddenly cannot accept
SIP messages? Great! One of the other nodes in the cluster will be sending
us test SIP traffic on occasion and when we see it doesn't work anymore, we
shut down and another node takes over.


> I'd just like to have HA that works everytime and everywhere and try
>

So does everyone else. That is why I want to write mod_ha_cluster. Because
what is there right now is overly complex, difficult to configure and test,
and does not and cannot catch all of the possible ways in which a FS system
can fail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-users/attachments/20130212/23082431/attachment.html 


Join us at ClueCon 2011 Aug 9-11, 2011
More information about the FreeSWITCH-users mailing list