On Sun, Feb 10, 2013 at 4:32 PM, Steven Ayre <span dir="ltr">&lt;<a href="mailto:steveayre@gmail.com" target="_blank">steveayre@gmail.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">One is the mod_ha_cluster is an N+1 Cluster, not a active/passive pair. That way if any single node fails, there&#39;s something to pick up the slack.</blockquote>


<div><br></div></div><div>Corosync clusters are not limited to active/passive pairs. It&#39;s just a very common setup.</div><div><br></div>For example you could have resource agents 1) to keep FS running on all nodes 2) for virtual IPs 3) for IP:port Sofia profiles. You can then define dependancies between them. That should let you keep FS running at all times and move an IP and the associated Sofia profiles to a new node that&#39;s already running FS when the original node fails. For maintance you can simply trigger that from the CRM.</blockquote>

<div><br></div><div>This is true, however it would require a very complicated resource agent to manage FreeSWITCH in a similar configuration to what mod_ha_cluster is designed to do. In addition, the functionality simply does not exist in FreeSWITCH right now to tell it to take over for a specific failed master and recover those specific calls. So, right now, using Pacemaker and Corosync, there is absolutely no way to run a N + x FreeSWITCH cluster. Also, the response time on a Pacemaker + Corosync cluster for failure detection and recovery is measured in seconds, which is not ideal for a real-time communications platform. Obviously, there is nothing at all preventing you from running Pacemaker and Corosync in addition to mod_ha_cluster. In fact, I was even considering providing some CLI arguments to allow FreeSWITCH (with mod_ha_cluster enabled) to be commanded from Pacemaker and act as its own resource agent. If you think that would be an interesting feature, I can look into what it would take to work that out. I previously wrote a resource agent for Broadvox when I worked for them, and I wrote another one after leaving them. Both were intended to manage FreeSWITCH as a master/slave pair, so I have some idea on how to do it, just not necessarily the specifics on making it do N + x the way I intend to have mod_ha_cluster operate.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

Secondly, in order to recover live calls, you need a list of the calls. That currently requires some sort of odbc (or postgres) with replication. Again, that&#39;s abstracted as part of mod_ha_cluster.<br>


Third: The docs mention a similar of pooling for registration, that you can register to one server and you&#39;re regged on them all without needing a DB to sync everything.</blockquote></div><div><br></div></div><div>Which can also be done using Corosync&#39;s IPC messaging API.</div>


<div><br></div><div>(Personally I prefer using MySQL Cluster via ODBC - which is HA, synchronous and offloads all load off of the FS nodes, but that&#39;s offtopic).</div><div class="im"><div><br></div></div></blockquote>

<div><br></div><div>I have used MySQL cluster and Postgres with my own replication daemon. Neither are ideal solutions for a multitude of reasons. Also, using something like Corosync&#39;s IPC messaging is not ideal either. Besides that, the FreeSWITCH core has a sufficiently robust API for doing network messaging, and also its own event system which happens to be perfectly suited for exactly the kind of messaging I need to accomplish. Using the Corosync IPC messaging API would be like trying to shove a large round peg through a small square hole. </div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


Fourth, according to the docs: single configuration for all FS instances, rather than manually ensuring each one has the same config.</blockquote><div><br></div></div><div>This could be achieved with the Corosync API.</div>

<div class="im"><div>


<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Fifth: Voicemail clustering? Or we&#39;ll have to wait for mod_voicemail&#39;s APIs to be rewritten for that, perhaps...</blockquote>


<div><br></div></div>That&#39;s not going to happen without a storage API added to the FS core - you&#39;re always going to need some 3rd party such NFS, CIFS, DRBD etc. mod_voicemail is hardcoded to use the ODBC interface but that&#39;s only for the index not the <a href="http://recordings.nc" target="_blank">recordings.nc</a><div>


<br></div><div>Corosync would allow you to make FS depend on the NFS/whatever service running so if the storage backend has failed FS would move to a node where it is available - not necessarily possibly from within mod_ha_cluster itself.</div>

</blockquote><div><br></div><div>Well, I *could* broadcast an entire voicemail to all nodes, but that seems like a bit of overkill, and I don&#39;t see any real reason to code a storage system into FreeSWITCH. That really would be reinventing the wheel. I don&#39;t think the majority of users would find it all that difficult to set up a shared NFS space somewhere, or even use something like MooseFS. In fact, MooseFS is so easy to use, I went from reading about it for the first time to having a fully functional 6-node shared storage cluster in about 45 minutes. This is one place where a 3rd party solution is really the optimal approach. </div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

There&#39;s certainly <i>something </i>special possible with mod_ha_cluster that can&#39;t be done with existing solutions cleanly, if at all...</blockquote>


<div><br></div></div><div>Don&#39;t misunderstand me, I think it&#39;s a great idea to have a module aimed at HA and sharing state across a cluster while being able to detect new failure conditions.</div>

<div><br></div><div>I just think that in the specific area of node monitoring and messaging across the cluster it would be better to use a well tested and proven solution such as Corosync which is based on a large number of papers, algorithms, and generally decades of work. Every time I&#39;ve seen/heard of an attempt to redo that from scratch it&#39;s been unreliable especially in unexpected failure conditions. Simply because it&#39;s a dependency on another program isn&#39;t a good enough reason not to use it - you&#39;re already depending on various other programs (Linux, sysvinit, monit, cron, syslog etc) anyway. By not using it you&#39;re just adding extra work, adding unnecessarily complexity and increasing the risk of bugs. Corosync would also have advantages because the tools to migrate services to another node for maintenance, detect and restart resources on failure etc already exist.</div>

</blockquote><div><br></div><div>I think you greatly overestimate the complexity of the messaging and node monitoring required to make FS run as a multi-master, multi-slave cluster, detect a node failure, and have it automatically recover calls on a slave system. You seem to be thinking of the task as if it were a general purpose HA solution which needs to work for any random piece of software across every imaginable network configuration and software deployment scenario. </div>

<div><br></div><div>The messaging I need to do is simple:</div><div><br></div><div>1) I need to send events from the local FS node to all other slave nodes in the cluster so they can synchronize call state, registration information, call limiting information, etc.</div>

<div>2) I need to send heartbeats from all nodes to all nodes out all configured NICs so they can keep track of who is in which state and which is the designated slave in case of failure.</div><div><br></div><div>These two tasks accomplish everything needed to make this system work. The heartbeats let each node calculate the state of the cluster. There is no &quot;single brain&quot; in the cluster. There is a very specific and well-defined set of rules by which the state of the cluster is calculated by each node, and all nodes will always arrive at the same conclusion so long as you have multiple physical networks to ensure communication is never broken between sets of nodes. </div>

<div><br></div><div>The decision making happens on the slave nodes only. The masters are already in their role and will stay that way unless something catastrophic occurs.  They have ways to detect internal failures, and I will have a way set up for them to detect and deal with a segfault of FS, as well. If such a failure occurs, their single purpose task is to shut down as cleanly as possible and get themselves back into a stable state. The slaves also have only one goal: to recover the calls of the failed master and become the new master. One slave is always designated as the primary candidate, and all slaves know which it is at all times (once they exchange their first heartbeats). They also know which is the secondary, tertiary, etc. It is all pre-determined up-front and only changes if a failure, reconfiguration, or maintenance event occurs. When the primary sees a master go down, it will perform a sanity check on itself first (to make sure it didn&#39;t have an issue) and then take over for the first master it saw go down. Once it has started that process, it broadcasts that it is switching to master and which master it is taking over for. At this point, the secondary immediately becomes primary to all remaining slaves and it is immediately available to take over for any other master nodes which might also have failed simultaneously. </div>

<div><br></div><div>The whole process is a very specific and very orchestrated process. The whole thing is barely comparable to the complexity of the failure detection and fail over process which Pacemaker has to deal with. I am not trying to recreate something like Pacemaker. It is a hugely complex system which has to deal with all sorts of random and generalized configurations of endless types of software. Pacemaker is like an interstellar spacecraft while I just need a rocket to put a satellite in orbit.  </div>

</div>