Downtime is rarely an option for mission critical applications and while many strides have been made over the last decade to ensure uninterrupted application availability, some gaps in providing protection still remain. One of these is reducing the time it takes to detect when a failure has occurred so a recovery on a secondary server can be promptly initiated and successfully completed. It is expediting these server recoveries that the new Intelligent Monitoring Framework introduced in Storage Foundation HA 5.1 Service Pack 1 (SP1) accomplishes.
Many organizations are more than happy with the speed of failover on the clustering solution that they use to support mission critical applications as these clustering solutions enable failovers to occur in mere minutes. But for some of these mission critical applications minutes of downtime can prove unacceptable. Yet closing this failover gap from over a minute to a minute or less has proven to be a challenge for two reasons.
First, clustering solutions poll the services on a server operating system to detect if one of its services has failed. However depending on what service has failed and when it fails, it could take up to a minute before the failure of the service is even detected so a failover of the server cannot be initiated until that occurs. This time it takes to detect the service failure is a major contributor to the minute or more of application downtime.
Second, the polling could theoretically be sped up so it checks for failed services on the operating services more frequently but this increases overhead on the server. This could then in turn slow application performance which is also unacceptable to these end users. These trade-offs have to date left users in this stalemate.
This brings us to the introduction of the Intelligent Monitoring Framework (IMF) in Storage Foundation HA 5.1 SP1 for Veritas Cluster Server (VCS) Process, Mount and Oracle agents. IMF is an extension to the existing VCS agent framework that has been enhanced in SP1 to detect state change notifications within seconds on the operating system and without increasing the overhead on the application server.
To accomplish this, the VCS agent no longer polls the operating system as it did in the past looking for notifications that a process has died or is in a “hung” state. Rather the SP1 VCS agent takes a more passive role and interfaces directly with the operating system kernel on each application server via APIs in that operating system kernel.
Now when a process on the application server dies or “hangs” the operating system generates an alert that is exposed by its API and then automatically and nearly instantaneously captured by the VCS agent. This eliminates the need for the VCS agent to continually poll for these changes as the VCS agent is proactively notified by the server’s operating system while also expediting notifications to the VCS agent.
The prior technique of polling required the VCS agent to go through and check each process on each server’s operating system for failures via a round robin process that could take up to 60 seconds or longer. Using this new technique of monitoring notifications, as soon as an event occurs on the OS, the VCS agent is immediately notified and a failover to the secondary server can then be initiated.
IMF’s technique of capturing alerts generated by the operating system also comes into play once the failover on the secondary physical server has started. Now if a specific process should “hang” on the secondary server during a failover, IMF will again be alerted as to which process or processes on that secondary server is causing the failover to “hang.” Alerts can then be generated and corrective action taken so the failover to the secondary server can be completed.
In the last few years the need for mission critical applications to maintain a constant state of availability has grown more important such that even a minute of downtime is too long for some applications. IMF in Storage Foundation HA 5.1 SP1 takes that concern off the table with its pro-active detection of process failures while also reducing the overhead associated with monitoring these failures. In so doing, Storage Foundation HA 5.1 SP1 gives organizations the faster recoveries and lower server overhead that they seek by merely upgrading to the latest version of an application that they already know and trust.