On Tue, Feb 12, 2013 at 9:21 AM, Marcin Gozdalik <span dir="ltr">&lt;<a href="mailto:gozdal@gmail.com" target="_blank">gozdal@gmail.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


2013/2/11 Eliot Gable &lt;<a href="mailto:egable%2Bfreeswitch@gmail.com" target="_blank">egable+freeswitch@gmail.com</a>&gt;:<br>

<div>&gt; On Mon, Feb 11, 2013 at 7:36 AM, Marcin Gozdalik &lt;<a href="mailto:gozdal@gmail.com" target="_blank">gozdal@gmail.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; +1<br>

&gt;&gt;<br>

&gt;&gt; I do not doubt mod_ha is necessary inside of FS  and it may be<br>

&gt;&gt; better/simpler than writing Pacemaker resource agent, but writing<br>

&gt;&gt; yet-another-cluster-communication-engine is IMHO the wrong way to go<br>

&gt;&gt; and using Corosync for communication will give a lot of value from<br>

&gt;&gt; mature codebase.<br>

&gt;&gt;<br>

&gt;<br>

&gt; I understand what you are saying, but what I am trying to get across is that<br>

&gt; I am not writing yet-another-cluster-communication-engine. All I am really<br>

&gt; doing is combining a multicast messaging API written by Tony and the event<br>

&gt; API in FS to broadcast existing state information between multiple FS nodes,<br>

&gt; as well as adding a tiny amount of logic on top of that to coordinate call<br>

&gt; fail over and recovery. That&#39;s probably a little over-simplified, but it<br>

&gt; gets the point across. The network communication code is already in FS and<br>

&gt; well tested. The event system is already in FS and well tested.<br>

<br>

</div>I also think I understand what you are saying. It means we have<br>

trouble putting thought into writing ;)<br>

&gt;From what I understand what you are trying to achieve is that every<br>

node in FS &quot;cluster&quot; knows what are the nodes and whether they are<br>

down or up.<br>

What I am saying is that this simple task is fundamentally hard.<br>

Sending and receiving multicast is easy, but keeping distributed state<br>

consistent between nodes in cluster is hard (like in really hard,<br>

harder than writing VoIP softswitch all over again), especially in<br>

case of Byzantine failures (i.e. nodes lying that they are down when<br>

they are up or other way round). I am no big expert in the area but<br>

seen at least 2 cases (MMM -<br>

<a href="http://www.xaprb.com/blog/2011/05/04/whats-wrong-with-mmm/" target="_blank">http://www.xaprb.com/blog/2011/05/04/whats-wrong-with-mmm/</a> and Chubby<br>

in Google - <a href="http://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf" target="_blank">http://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf</a>)<br>

where people were trying to write (MMM) or use (Chubby) some kind of<br>

distributed code and failed.<br>

That&#39;s why whenever I see anything related to distributed state I say<br>

that it&#39;s way beyond my understanding and best is to use something<br>

that works.<br>

<div></div></blockquote></div><br><div><br></div><div>You were fortunate to have that resource available, as well as (I assume) an already made resource agent available for managing FreeSWITCH. I had to learn it from this:</div>


<div><br></div><div><a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.0/pdf/Pacemaker_Explained/Pacemaker_Explained.pdf" target="_blank">http://clusterlabs.org/doc/en-US/Pacemaker/1.0/pdf/Pacemaker_Explained/Pacemaker_Explained.pdf</a></div>


<div><br></div><div>I also had to craft a resource agent to manage FreeSWITCH (none existed at the time). Then I found out Pacemaker was buggy (it has gotten much better since I started using it) and wouldn&#39;t properly honor colocation constraints or grouping correctly in certain failure conditions, so I had to make the resource agent handle managing all the IP addresses for FreeSWITCH (each instance had 12 Sofia profiles with each one running on a different IP). I spent months testing hundreds of different possible failure conditions and fixing dozens if not hundreds of bugs in the configuration and in how the resource agent managed everything and reported on the health of FreeSWITCH. Everything from someone accidentally removing a needed IP from the system to a failed hard drive to a Sofia profile failing to load to firewall rules accidentally blocking needed ports, etc. If you spent only one day setting up such a system, I am certain you failed to account for dozens if not hundreds of possible failure conditions. At the end of those 3 months of hell, I had a single pair of nodes which I could rely on to &quot;do the right thing&quot; under practically any failure condition. However, even then, I still had several dozen ways I could simulate FreeSWITCH failing which the system simply could not detect efficiently. I made attempts at testing some of them, but the load induced on the system to test them frequently enough to matter made the system fall outside the specifications I needed for the project to be profitable and workable. </div>


<div><br></div><div>I have years of experience building and deploying FreeSWITCH clusters with Pacemaker and Corosync and hunting down and gracefully handling practically every conceivable way such a system could fail. I understand you think it&#39;s hard to do; and that is not without reason. I&#39;ve lived it; I&#39;ve done it. I know what&#39;s involved in the process. I simply want to take my experience with it and write it down in code in the form of mod_ha_cluster so that other people don&#39;t have to waste their time relearning all the things I already know with regard to making FreeSWITCH run in an HA setup. In the absence of Pacemaker and Corosync, my goal is to provide mod_ha_cluster enough awareness that the vast majority of failure cases are handled gracefully and FS can take care of itself for bringing a slave online to take over for a failed node. However, there is no reason I cannot also write it to let Pacemaker and Corosync give it direction as to which slave to turn into a master. So, if it makes you more comfortable, think of it as a glorified resource agent which always happens to know about the &quot;deep state&quot; of the nodes and can test for things that traditional resource agents can never do effectively. Then, when you do a &quot;shallow&quot; poll of the state, you can get back the &quot;deep state&quot; instead, but at the cost of doing a &quot;shallow&quot; test. And, on top of all that, it will handle synchronizing various data between the nodes so you don&#39;t need to rely on an external HA database. </div>


<div><br></div>