[Freeswitch-users] High Availability Cluster Module for FreeSWITCH

Wed Feb 13 14:16:16 MSK 2013

2013/2/13 Eliot Gable <egable+freeswitch at gmail.com>:
> On Tue, Feb 12, 2013 at 5:21 PM, Marcin Gozdalik <gozdal at gmail.com> wrote:

> What you can do is have Node A broadcast its state many times per second.
> Nodes B, C, etc, all listen for that state. They keep track of the state and
> when they *stop seeing* Node A, they know *something* failed. It's that
> simple.

I'm not arguing with that, I'm arguing something slightly different:
 - sometimes B and C can't agree whether A is up or not (B sees A, C
does not see A)
 - B and C have to agree on who will take over from A
 - when B and C on who takes over from A (let it be B) it might
possibly happen that *during* takeover B goes down and C must be able
to recognize this situation and take over both from A and B

Also it is a challenge to define what it actually means that a node is
down. Most of the situations are clear enough, but some are tricky to
check.

All this are *solvable* problems, I just didn't see how those are
addressed, but probably I wasn't looking hard enough.

>> By "impossible" I mean handling all such gray areas. Certainly if
>> you'll power down the box or unplug Ethernet it is possible to migrate
>> live calls to some other box.
>
>
> There are no such "gray" areas. That is just a fantasy you have. Everything
> in computing is black or white, true or false. If you don't know, you test
> and become certain. If you are prevented from accurately testing because
> (for example) you cannot see the node at all anymore or interact with it in
> any way, you assume the worst and nuke the box (STONITH).

It's not so simple in computing - perfectly deterministic programs can
exhibit chaotic behavior. The "gray" area I'm writing about is that
the tests are always somewhat inaccurate, i.e. all tests tell OK, yet
the software being tested does not perform its functions for the end
user.

> Let's say the FS box runs out of memory. Great! I designed mod_ha_cluster to
[snip]
> shut down and another node takes over.

I agree with what you've written. I'm nitpicking that sometimes you
get a timely heartbeat and yet the service in question (be it FS or
any other software) does not function properly for one reason or
another. You make another test to e.g. check if it responds to SIP,
yet still there are real conditions under which SIP works and yet real
calls can't get through (or at least some of them). Consider
configuration de-synchronization, AAA subsystem malfunctioning, one
node of DB going down. They are not core FS elements but IMGO when
speaking of competing with commercial offerings all have to be taken
into account.

>> I'd just like to have HA that works everytime and everywhere and try
>
>
> So does everyone else. That is why I want to write mod_ha_cluster. Because
> what is there right now is overly complex, difficult to configure and test,
> and does not and cannot catch all of the possible ways in which a FS system
> can fail.

Frankly I'd love to get this conversation going into more constructive
area than just pure speculations. What about some kind of blueprint
that would describe how such a module could work and what scenarios
were considered and how such a module would deal with them?

--
Marcin Gozdalik