[Freeswitch-users] High Availability Cluster Module for FreeSWITCH

Marcin Gozdalik gozdal at gmail.com
Tue Feb 12 17:21:23 MSK 2013

2013/2/11 Eliot Gable <egable+freeswitch at gmail.com>:
> On Mon, Feb 11, 2013 at 7:36 AM, Marcin Gozdalik <gozdal at gmail.com> wrote:
>> +1
>> I do not doubt mod_ha is necessary inside of FS  and it may be
>> better/simpler than writing Pacemaker resource agent, but writing
>> yet-another-cluster-communication-engine is IMHO the wrong way to go
>> and using Corosync for communication will give a lot of value from
>> mature codebase.
> I understand what you are saying, but what I am trying to get across is that
> I am not writing yet-another-cluster-communication-engine. All I am really
> doing is combining a multicast messaging API written by Tony and the event
> API in FS to broadcast existing state information between multiple FS nodes,
> as well as adding a tiny amount of logic on top of that to coordinate call
> fail over and recovery. That's probably a little over-simplified, but it
> gets the point across. The network communication code is already in FS and
> well tested. The event system is already in FS and well tested.

I also think I understand what you are saying. It means we have
trouble putting thought into writing ;)
>From what I understand what you are trying to achieve is that every
node in FS "cluster" knows what are the nodes and whether they are
down or up.
What I am saying is that this simple task is fundamentally hard.
Sending and receiving multicast is easy, but keeping distributed state
consistent between nodes in cluster is hard (like in really hard,
harder than writing VoIP softswitch all over again), especially in
case of Byzantine failures (i.e. nodes lying that they are down when
they are up or other way round). I am no big expert in the area but
seen at least 2 cases (MMM -
http://www.xaprb.com/blog/2011/05/04/whats-wrong-with-mmm/ and Chubby
in Google - http://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf)
where people were trying to write (MMM) or use (Chubby) some kind of
distributed code and failed.
That's why whenever I see anything related to distributed state I say
that it's way beyond my understanding and best is to use something
that works.

I have
> already written the code to the point that it parses the configuration files
> and starts sending heartbeats out all of the interfaces configured. I have
> also already written a lot of the code that deals with the state
> transitions. All I am talking about doing is implementing a tiny little
> finite state machine. It's a pretty trivial programming task. In fact, I
> think it was covered in my first year at Carnegie Mellon University. Of
> course, I had already figured out how to write such things in high school, I
> just did not know what it was called at that point. My point is, that this
> is not yet-another-cluster-communication-engine. It is a very specific and
> small finite state machine designed solely with the goal in mind of making
> FS have just enough information to coordinate call fail over internally. If
> I recall correctly, a lot of people also said writing
> yet-another-VoIP-server was a waste of time, but now we have FreeSWITCH, and
> it was obviously worth the effort. And I am not even trying to do something
> as complex as that. If you think this is
> yet-another-cluster-communication-engine, you are missing the point. It is
> not. It never will be.

See above - if it will never be and you are trying to achieve
distributed, consistent state between nodes, IMHO you are going to get
it wrong. Frankly I lack the knowledge and time to check if Corosync
API is perfect for this task. As Anthony suggested elsewhere maybe it
is possible to abstract the communication/keeping distributed state
part so that it would be easy to provide Corosync or other
(OpenReplica? Zookeeper?) implementations.

> Look at Sonus, Genband, Broadsoft, Veraz, etc. All the big-name
> carrier-grade telecom providers have a built-in solution for automatic call
> fail over. The only way FreeSWITCH will ever compete with such solutions is
> if it also has that feature. Pacemaker and Corosync are overkill just to get
> FS to handle single node failures and provide call recovery. It took me a
> full 3 months of working with them every day to really understand how to
> deploy them properly in conjunction with FreeSWITCH and Postgres to provide
> a carrier-grade hot-standby solution which was robust enough to handle 99%
> of the failures I could throw at it. Granted, this was back when the
> configuration still needed to be written by hand in XML and prior the
> existence of any resource agent for FreeSWITCH. But, even with those
> changes, deploying Pacemaker and Corosync is not a simple task. If that is
> the requirement for FS to have HA, it will never truly stand a chance
> against commercial offerings.

I believe that Clusters from Scratch
allowed me to setup a working Pacemaker/Corosync installation on
Debian in less than a day.
It is a fair point that FS to compete with big names (at least in
marketing buzz-feature checklist) has to have HA. Trouble is HA is
always hard. Making it simple and work is the ultimate goal but I'd
rather shoot at "work" first and later for "simple".
Comparing FS to commercial offerings is hard as the way FS is deployed
is usually different than commercial competitors. If you buy Broadsoft
you don't get a Debian package or sources to compile - you get a bunch
of highly-paid consultants that install and configure everything for
you. If HA in FS would work that way it would maybe be even better for
building business around FS ;) Jokes aside - if somebody wants to
configure FS in HA evironment I don't think it is much of an obstacle
for him to configure Corosync as already he has to have some kind of
DB failover as well, redundant switches, power, etc. If some
commercial vendor is able to manufacture an out-of-the box solutions
based on FS with a "HA" checkbox somewhere in Web configuration -
well, congratulations to him and he I hope he would be deservedly
reaping profits from his product.

Marcin Gozdalik

Join us at ClueCon 2011 Aug 9-11, 2011
More information about the FreeSWITCH-users mailing list