On Tue, Feb 12, 2013 at 5:21 PM, Marcin Gozdalik <span dir="ltr">&lt;<a href="mailto:gozdal@gmail.com" target="_blank">gozdal@gmail.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I did some, I got some anecdotes from people running them. It<br>

certainly is possible to handle *some* failure cases, but there are<br>

some *other* cases that will be not handled correctly.<br>

<br></blockquote><div><br></div><div>I have actually used Sonus, Genband, both of which seamlessly and gracefully handle failures. In fact, I cannot think of a single time where either of them failed to migrate calls over to an alternative system when the one handling calls had even one thing go wrong which prevented them from handling calls properly. And, as I mentioned earlier, I have also used P + C to build a multi-city, 6-node FS cluster (three iterations of it, actually) which could do essentially the same thing. While I don&#39;t have the resource agent I wrote for that cluster which handled the vast majority of failure conditions (including the two I mentioned earlier where you turn off ARP on the interface or set the speed/duplex wrong), I could fairly easily update my newly written resource agent to cover such scenarios (if someone wanted to pay for my time to do it). </div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Personally I&#39;ve observed OpenSIPS that run out of some internal memory<br>

and anyhow you wanted to monitor it it will reply it is alive. It will<br>

even route your simple call scenarios well! Unfortunately the &quot;real&quot;<br>

calls were *usually* timing out, but always. Sorry, but I can&#39;t<br>

imagine how you can automatically handle that, i.e. discover that this<br>

node is &quot;bad&quot; and fail-over to some other. Even if you do that<br>

correctly your monitoring tool can check that this node is operating<br>

perfectly well (because after all the traffic is diverted from the<br>

faulty node it begins to work well) and will want to move the traffic<br>

back.<br></blockquote><div><br></div><div>Just because you cannot imagine how it would work, it does not mean everyone else has the same limitations. You are consistently referring to a node (let&#39;s call it node A) telling one or more other nodes (B, C, whatever) about its state and ability to handle calls. You are taking the wrong approach entirely. You cannot have a failed node reliably report it&#39;s state! What you can do is have Node A broadcast its state many times per second. Nodes B, C, etc, all listen for that state. They keep track of the state and when they *stop seeing* Node A, they know *something* failed. It&#39;s that simple. </div>

<div><br></div><div>At this point, it is up to those nodes to determine *what* failed. They have to do things like check whether *they* failed, check whether the *network* failed, etc. There are very simple and sane ways they can do each of those things. Eventually, they can conclude that *Node A* failed. They can do this determination very quickly. Think microseconds or maybe 1 - 2 milliseconds quickly. At this point, nobody tells them what to do. They know already what they need to do. They have already pre-negotiated what will happen in the event of a failure, or they have very specific programmatic instructions on what to do, and so they act immediately. </div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

By &quot;impossible&quot; I mean handling all such gray areas. Certainly if<br>

you&#39;ll power down the box or unplug Ethernet it is possible to migrate<br>

live calls to some other box.<br></blockquote><div><br></div><div>There are no such &quot;gray&quot; areas. That is just a fantasy you have. Everything in computing is black or white, true or false. If you don&#39;t know, you test and become certain. If you are prevented from accurately testing because (for example) you cannot see the node at all anymore or interact with it in any way, you assume the worst and nuke the box (STONITH). </div>

<div><br></div><div>Let&#39;s say the FS box runs out of memory. Great! I designed mod_ha_cluster to cause FS to segfault if it runs out of memory. Heartbeats stopped, other node takes over. No gray area. Wait. Did the IP get removed from the box? No? Don&#39;t know? STONITH. Did that fail? Seriously? You deployed the system wrong; don&#39;t blame me for your mistakes.</div>

<div><br></div><div>Did the hard drive go away? Great! I have a test for that and a way to tell the other nodes I need to be STONITH&#39;d if I cannot reboot myself.</div><div><br></div><div>Did FS deadlock? Great! No more heartbeats. Other node takes over. STONITH.</div>

<div><br></div><div>Did a module in FS take an event from the eventing thread and get stuck spinning, never to return? Great! No more heartbeats. Other node takes over. STONITH.</div><div><br></div><div>Did a module in FS launch 128 threads, all of which want to use 100% CPU? Great! Untimely heartbeat delivery, other node takes over. STONITH.</div>

<div><br></div><div>Did your dual-router network have the connection between the two routers go down leaving you with a split network? Great! If you have that secondary network I talked about, it&#39;s all properly detected and handled for you! If not, well, don&#39;t blame me for your failures.</div>

<div><br></div><div>Did someone slap a firewall rule on the box and we suddenly cannot accept SIP messages? Great! One of the other nodes in the cluster will be sending us test SIP traffic on occasion and when we see it doesn&#39;t work anymore, we shut down and another node takes over.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I&#39;d just like to have HA that works everytime and everywhere and try<br></blockquote><div> </div><div>So does everyone else. That is why I want to write mod_ha_cluster. Because what is there right now is overly complex, difficult to configure and test, and does not and cannot catch all of the possible ways in which a FS system can fail.</div>

<div><br></div><div><br></div></div>