May 10, 2013 1 Comment
Hello today I am writing a small article on the huge subject of high availability (HA) and Disaster Recovery (DR) and why you should consider them to provide highly available IT services and minimise downtime for your users.
I’ll cover brief definitions and examples of High Availability (HA) and Disaster Recovery (DR) planning, and explain their differences. I feel this is important as many people often refer to HA and DR interchangeably and sometimes in the wrong context.
I talk to many clients and help to architect highly available systems, mostly based on Microsoft technologies such as Windows, SharePoint and SQL Server and many common questions and discussions come about regarding HA and DR. I thought it would be good to get this down into a high level article to help others and for me to reference it when talking about these things with others.
What is High Availability (HA) ?
In the world of IT infrastructure, a service that is highly available is considered to be resilient from component failure at the site level, where a site would be a data centre. Resilience here means that the service would have little or no downtime if any one component in the infrastructure hardware or software stack failed. A component could be defined from anything from a power supply in a server, a motherboard, a hard disk or a network switch at the hardware level. This then bubbles up to the software level in the guise of the operating system and then up to the user application or/and any other software that may be present such as drivers or backup software.
The availability of any service can be made resilient in a number of ways, both at the hardware & software level.
Here are some examples :-
Component Level Resiliency
The above table gives a rough idea on how we can architect some level of high availability. The application row is very high level as application resiliency depends on how the application is written and if it is cluster aware for example, as some applications can not be added to a cluster effectively.
An example for SharePoint would be to use two or more servers in what would be termed as a “farm” and use network load balancing (NLB) to distribute network traffic to the most responsive server for example. For Microsoft SQL Server we could create a cluster as SQL Server is cluster aware. Well the Database engine, Analysis Services and SSIS (to a degree) are. For Reporting Services we would use the farm scenario like we have in our SharePoint example, with NLB.
So now we have a highly available service in our data centre which is running on highly available hardware with multiple servers being fed from multiple power sources and internet links. In effect we are protected against any single component failure and we can automatically failover to another physical host, virtual server, power supply, or internet feed for example. We may want to control this so failover could be manual for some or all components. This is where monitoring comes in but that’s a subject for another day !
A Highly Available Virtual Host Cluster
In order to provide our virtual machines (VMs) some high availability and protection from physical hardware and potentially power and network link outages we could create a cluster. The principle of a virtual host cluster is the same in Hyper-V as it is for VMware in that the hosts can support multiple VM’s which can automatically move (fail over) to another host if required.
Here we see a highly available three node Hyper-V cluster which can host a number of virtual machines.
The blue machines represent general VMs and the red one represents a mission critical virtual machine that must be kept alive at all times.
The cluster is designed to be highly available in that there are multiple hosts with multiple connections to the various networks required such as heartbeat, Live Migration, iSCSI SAN for storage and the LAN. Any network switch, network card or port can fail and the system will continue to run. A host could also fail and any VMs which are marked for high availability would be automatically moved to another host server without any downtime (host capacity allowing). Assume the iSCSI SAN storage is provided by a highly available storage system with RAID disk configurations to help protect against disk failure.
So we are fully protected against failure of any physical server, network and storage components.
So why do I need to use clustering or NLB when my environment and VM is already highly available ?
The answer to this common question is that there are several good reasons to protect your service further and beyond the physical level mentioned just above. The short answer will come if you think of a blue screen, OS freeze or maintenance.
These reasons are listed here with a bit more detail:-
Application Level High Availability through NLB or Clustering
So are we done ? Our application or service is now highly available and nothing can possibly take down our application…. or can it ?
Well if you have a highly secure solid and resilient data center where power, cooling, access etc. is all good then almost all is well and the chances of service outages are reduced significantly. However as we live in the real world there CAN be things that can go wrong at the data center level. This leads us into disaster recovery.
Disaster recovery is required to protect services from a disaster at the data center level. The process and planning of bringing services back online in the case of a disaster is referred to as Business Continuity. As in, if your business depends on certain services, will these services be available in the event of a disaster in order for the business to continue to function? Other factors to consider in business continuity are things like how will users be notified ? If the main offices are no longer accessible where will be people work from and so on. The most likely cases which would require invocation of a disaster recovery plan are things like natural disasters such as floods, fires or even a malicious attack or a UFO, meteorite or plane crash into the data center. Some of these things are more likely to happen than others !
In the event of a disaster, services would resume from the DR site ideally without too much or any disruption to the users of the services. There may well be manual system administration tasks to complete in the case of a DR scenario such as IP address configuration, SAN storage and cluster configuration to bring up services. As technology evolves these manual steps get simpler, the effort required AND recovery time is also reduced. An example of this is Microsoft’s Hyper-V Replication which is designed for a DR scenario. Hyper-V Replication allows the administrator to configure a separate IP address for the DR site when normally this may have been done by using additional scripts or manual steps.
Other reasons for having DR would be to gain confidence from your clients and also to prove to auditing bodies that your organisation can continue in the event of a disaster. The benefits of having a GOOD DR process and architecture is that DR can be tested properly without or with minimal disruption (out of hours please !) to consumers of a service.
Having a good HA and DR process in place for your mission critical systems provides your business with the confidence to perform value adding activities required for the organisation to be successful. The outcome of this may be better customer satisfaction, more profit or even saving lives in some cases. HA and especially DR require proper planning and investment and some example questions to ask when considering a HA and DR strategy are “Why do we need it ?”, “What does it cost us if there is an outage ?”, “Whose responsibility is it if x service is down for x amount of time ?”, “How long can we function without the service?”.
Most importantly I think it is essential to think about “What would happen if we did nothing ?”. For example will you lose more money if the service was down for 1 day than you would save in the implementation of a good HA and DR strategy ? Would you lose your trading license or lose customers ?
Well it is a very broad topic and I hope I have highlighted some key or interesting points for you to take away. Please do comment on the post if you found this useful or if there are other important points which could help others. And thanks for reading. I know it’s a long article this time…