In viable information systems, data must be available when required by users. This mandates that all systems and subsystems tasked to serve the data for the user, including any security controls, must each be functioning correctly together. A “high availability” system is one that is supposed to serve its users at all times except for certain, and usually short, scheduled downtimes. These times are usually published so the user is aware. In these systems, preventing unscheduled service disruptions is paramount. Therefore, the following types of events are taken into account in system design: power outages, hardware failures, and denial-of-service attacks. If the requirements for availability cannot allow for long enough periods of downtime for maintenance, then a sytem design must also account to maintain availability during events like system upgrades, backups, patches, etc.
For many mission-critical applications, availability may be perceived as the most important quality. Nevertheless, more often than not highly available systems also tend to be considered highly confidential, and are expected to have plenty of controls for data integrity. In general, as the requirements for availability increase, the controls necessary to maintain confidentiality and integrity increase proportionally.
Reliability and Availability
The more reliable critical system components and subsystems are, the more available the overall system is. Systems and their components are not reliable enough unless engineering for reliability is included in the design phase of the system. Below is a basic design approach to increase system reliability to ensure that a system can achieve high availability.
- Measure availability.
- Log system events.
- Eliminate single points of failure.
- Routinely test crossover to redundant systems.
Employ a separate system to measure availability. This may be one your organization designs for use on your organization’s network. If so, great care will need to be taken to ensure it is not susceptible to the same events your target system is vulnerable to. Alternately, there are many relatively inexpensive internet system applications which are easily configured to continually measure, over a certain time period (e.g. every five minutes), any internet system’s uptime. They not only detect and log availability/downtime, but collect all kinds of metadata that can prove immensely valuable when troubleshooting a problem with availability in an application (e.g. connection time, time to first byte, time for entire message delivery, etc.). Since these systems are by default on different computer and network systems, often in different cities or countries, they will provide the physical and logical independence necessary to monitor your system’s uptime.
Log system events
You will also want to employ an internal subsytem that logs all critical events. This will also log all important data local to your system which an external system cannot access. You certainly will want to log any failure or error events. But any events which might involve security, confidentiality, integrity, process time elapsed, etc., will prove valuable when trying to troubleshoot an availability problem.
Eliminate single points of failure
Review your system to identify single points of failure, and add redundant components or subsystems to compensate if one component or subsystem fails. This requires that you have a well-documented system design. Review your hardware, software, and service providers. Also review your people to ensure no one person is a system’s single point of failure. For instance, perhaps only one person knows how critical parts of the system function. If that person is unavailable when something happens to compromise that area of the system, the availability of the system will suffer. Further, your people know most of these single points of failure. Scheduling brainstorming sessions where you collect and rank known single points of failure will invariably prove helpful and uncover items for which you hadn’t accounted. Where possible and affordable, add redundancy to the system across all these areas and you increase the capability for high availability.
Routinely test crossover to redundant systems
When you design redundancy to eliminate the single point of failure, the resulting crossover to the redundant system or component becomes the single point of failure. Reliable systems must provide for reliable crossover. Testing the crossover to a redundant component or system on a routine schedule is important. Otherwise, you risk all your investment in the redundant systems and components for what usually is something comically simple: a fuse trips because the extra system draws more current than anticipated when coming online.
Scheduled and unscheduled downtime
Depending on your system and the users it is serving, you may have the ability to schedule downtime. Users need to be aware of the times the system may be unavailable for use. If your system is used during weekdays during normal business hours, then scheduling time in the evening or on a weekend may not be disruptive. These times are ideal for installing security or system patches, upgrades, and new hardware, etc. Scheduled downtime is normally not factored into any equation measuring your system’s uptime.
Unscheduled downtime events, however, are the most disruptive and affect your system’s average uptime. This may also impact your organization’s bottom-line not merely from disgruntled users, but contractually by not meeting minimum uptime requirements in your service-level agreement (SLA) with your customers. Preventing these events are a major focus, of course. But equally important is being able to quickly mitigate by minimizing the length of the downtime itself. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure, or some other anomaly. One example of an unscheduled downtime event is too many unfinished database requests resulting in long response times. This can be caused by a corrupted or bad index file. In any case, having separate external and internal uptime measurement systems running can catch these situations and alert you to begin mitigating the downtime sooner. Studying your logs afterward can reveal thresholds your system begins to pass through prior to a failure that you should alert for earlier. Other things that commonly cause unscheduled downtime are power outages, failed components, security breaches, network connection failures, etc.
Availability is usually measured as a percentage of uptime in a given year. Let’s say a system is 90% available — that may sound great. However, for a system have have 90% availability over 365 days in a year, it may have up to 36.5 days where it is unavailable due to unscheduled downtime. That tranlates to 2.4 hours per day (presuming it should be available 24 hours a day). For a critical application or system, this is actually terrible and the system will probably not be viewed as viable if it persists. A 99% percent uptime allows for 3.65 days per year, or 14.4 minutes per day. Similarly, 99.9% uptime allows for 1.44 minutes per day, and 99.99% uptime allows for 8.64 seconds per day. “Five nines” availability is 99.999%, or 864 milliseconds per day! SLAs often refer to monthly downtime or availability a system is promised to deliver to its customers. SLAs usually offer service credits to monthly bills to compensate customers for downtime. Therefore, prevention is very important for the organization, but so is being able to mitigate the length of the downtime.
Availability and Point of View
Availability measurement and the SLAs which depend on it need to account for the details that can create misunderstanding. If a critical network segment for the entire Northeast US is cut for 2 hours during a weekday morning, yet your system remains available to anyone not on that segment, you can rightly claim your system was not affected. However, if you have important customers in the Northeast US, they may not recognize this failure. Therefore, having separate third-party monitoring from several nodes in disparate US cities may prove vital and can be accounted for in your SLA. Also, sometimes systems overall are up and operational, yet a key function or feature is unavailable that users view as important. Again, your SLA should account for situations like this and clearly state how and when uptime and downtime are measured and what constitutes uptime or downtime.