[Freeswitch-users] High Availability Cluster Module for FreeSWITCH

Eliot Gable egable+freeswitch at gmail.com
Mon Feb 11 00:00:21 MSK 2013


On Sun, Feb 10, 2013 at 3:11 PM, Marcin Gozdalik <gozdal at gmail.com> wrote:

> Don't get me wrong, I'd love to fund good HA module for FS, if not for
> any other reason that I could benefit from that.
> But having done a few installations of systems that were supposed to
> be "HA" and seen them fail when real problems came I know it ain't
> easy.
> Redundant networks are fine but following scenarios usually lead to
> both machines reply to ARPs for virtual IP and whole HA falls apart:
>
> 1) FS stops responding (e.g. due to heavy swapping or disk full), yet
> kernel manages to reply to ARPs
> 2) the HA module fails (like in crashes) but FS manages to work
> 3) some firewall rule is activated that stops multicast traffic (all or
> some)
>
> STONITH based on separate technology (like USB-USB connection
> connected to some KVM-over-IP with control over power) is
> indispensable in such scenarios.
>

I have also seen each of these cases when dealing with HA setups for
FreeSWITCH. That is part of why I want to write one specifically for
FreeSWITCH. General purpose HA systems cannot catch and properly deal with
the sorts of things I see when using one to run FreeSWITCH in a high
availability configuration.

For #1, it is pretty easy to detect when FS stops responding for whatever
reason. A watchdog thread inside FS can shut it down in a lot of those
cases, as well as remove the IP from the system. Aside from that, future
fencing will be useful to bounce the entire machine if the watchdog thread
does not or cannot handle the situation. This is another advantage of doing
the module inside FS.

For #2, a lot of testing is required to make sure that if the HA system
fails, it takes down FS with it, or to ensure that it cannot simply "not
work" while FS continues to work. It's very hard for a module inside FS to
fail in a way that FS keeps working yet the module doesn't. When using
Pacemaker and Corosync, it is easy for the HA system to fail in a way that
leaves FS running on a node yet the HA system thinks it is not there. This
is one of the advantages of doing a module inside FS.

For #3, I would like to eventually have the module scanning the firewall
rules for changes and enforcing a specific, pre-determined "known-good" set
of firewall rules. That is a ways off, but it is planned. Besides that,
this is yet one more reason why having the module in FS is the best option.
When running in FS, it is easy to determine if the traffic is being blocked
by a firewall rule (you simply do not receive the traffic). Again, a module
can more effectively catch and respond to this type of situation compared
to a general purpose solution like Pacemaker and Corosync.

I do not disagree about STONITH being indispensable in certain cases.
However, most companies I have worked with who try to do HA right now using
Pacemaker and Corosync completely ignore STONITH entirely. STONITH is
another case where the biggest problem is simply getting people to follow
best practices. And, interestingly enough, most companies actually get by
just fine without deploying a STONITH solution. They simply accept that
there is a risk of hitting a case where they need it and will not have it,
and they just plan to have someone manually reset something if that case
arises. This is why STONITH is not one of my first priorities in
development of the module. It is on the list because it is important (and I
feel it should be used in any carrier-grade deployment), but most companies
are perfectly OK to do without it. This is especially true for companies
who want HA on their home-grown PBX systems for their 10-person
organization. They couldn't care less whether the reset happens manually or
automatically in 1 out of 1,000 failures. As long as those other 999
failures are automatic, they will be happy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freeswitch.org/pipermail/freeswitch-users/attachments/20130210/30610730/attachment.html 


Join us at ClueCon 2011 Aug 9-11, 2011
More information about the FreeSWITCH-users mailing list