ebook img

NASA Technical Reports Server (NTRS) 19950018573: Software fault tolerance in computer operating systems PDF

30 Pages·1.8 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview NASA Technical Reports Server (NTRS) 19950018573: Software fault tolerance in computer operating systems

Chapter 11 in Software Fault Tolerance, Michael Lyu, Ed., Wiley, 1995. NASA-CR-1 97999 (NASA-CR-197999) SOFTWARE FAULT N95-24993 TOLERANCE IN COMPUTER OPERATING SYSTEMS (Illinois Univ.) 30 p G3/61 0045355 11 Software Fault Tolerance in Computer Operating Systems RAVISHANKAR K. IYER and INHWAN LEE University of Illinois at Urbana-Champaign ABSTRACT This chapter provides, data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% - of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved. 11.1 INTRODUCTION The research presented in this chapter evolved from our previous studies on operating sys- tem dependability [Hsu87, Lee92, Lee93a, Lee93b, Tan92b, Vel84]. This chapter provides data and analysis of the dependability and fault tolerance of three operating systems: the Tan- dem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS User Interface Software, Edited by Bass and Dewan © 1994 John Wiley & Sons Ltd . 2 IYER and LEE system. A study of these three operating systems is interesting because they are widely used and represent the diversity in the field. The Tandem/GUARDIAN and VAX/VMS data provide high-level information on software fault tolerance. The MVS data provide detailed informa- tion on low-level error recovery. Our intuitive observation is that GUARDIAN and MVS have a variety of software fault tolerance features, while VMS has little explicit software fault tolerance. Although an operating system is an indispensable software system, little work has been done on modeling and evaluation of the fault tolerance of operating systems. Major approaches for software fault tolerance rely on design diversity [Ran75, Avi84]. However, these approaches are usually inapplicable to large operating system^ as a whole due to cost constraints. This chapter illustrates how a fault tolerance analysis of actual software systems, performing analo- gous functions but having different designs, can be performed based on actual measurements. The chapter provides the information of how software fault tolerance concepts are imple- mented in operating systems and how well current fault tolerance techniques work. It also brings out relevant design issues in improving the software fault tolerance in operating sys- tems. The analysis performed illustrates how state-of-the-art mathematical methods can be applied for analyzing the fault tolerance of operating systems. Ideally, we would like to measure different systems under identical conditions. The reality, however, is that differences in operating system architectures, instrumentation conditions, measurement periods, and operational environments make this ideal practically impossible. Hence, a direct and detailed comparison between the systems is inappropriate. It is, how- ever, worthwhile to demonstrate the application of modeling and evaluation techniques using measurements on different systems. Also, these are mature operating systems that are slow- changing and have considerable common functionality. Thus, the major results can provide some high-level comparisons that point to the type and nature of relevant dependability issues. Topics discussed include: 1) investigation'of basic error characteristics such as software fault and error profile, time to error (TTE) and time to recovery (TTR) distributions, and error correlations; 2) evaluation of the fault tolerance of operating systems resulting from the use of process pairs and recovery routines; 3) low-level modeling of error detection and recovery in an operating system, illustrated using the IBM/MVS data, and 4) high-level modeling and evaluation of the loss of work in a distributed environment, illustrated using the Tandem/GUARDIAN and VAX/VMS data. The next section introduces the related research. Section 11.3 explains the systems and measurements. Section 11.4 investigates software fault and error profile, TTE and TTR distri- butions, and correlated software failures. Section 11.5 evaluates the fault tolerance of operating systems. Section 11.6 builds two levels of models to describe software fault tolerance and per- forms reward analysis to evaluate software dependability. Section 11.7 concludes the chapter. 11.2 RELATED RESEARCH Software errors in the development phase have been studied by researchers in the software engineering field [Mus87]. Software error data collected from the DOS/VS operating system during the testing phase was analyzed in [End75]. A wide-ranging analysis of software error data collected-during the development phase was reported in [Tha78]. Relationships between the frequency and distribution of errors during software development, the maintenance of the developed software, and a variety of environmental factors were analyzed in [Bas84]. An SOFTWARE FAULT TOLERANCE IN COMPUTER OPERATING SYSTEMS 3 approach, called orthogonal defect classification, to use observed software defects to provide feedback on the development process was proposed in [Chi92]. These studies attempt to tune the software development process based on error analysis. Software reliability modeling has been studied extensively, and a large number of models have been proposed (reviewed in [Goe85, Mus87]). However, modeling and evaluation of fault-tolerant software systems are not well understood, although several researchers have provided analytical models of fault-tolerant software. In [Lap84], an approximate model was derived to account for failures due to design faults; the model was also used to evaluate fault-tolerant software systems. In [Sco87], several reliability models were used to evaluate three software fault tolerance methods. Recently, more detailed dependability modeling and evaluation of two major software fault tolerance approaches—recovery blocks and N-version programming—were proposed in [Arl90]. Measurement-based analysis of the dependability of operational software has evolved over the past 15 years. An early study proposed a workload-dependent probabilistic model to predict software errors based on measurements from a DEC system [CasSl]. A study of failures and recovery of the MVS/SP operating system running on an IBM 3081 machine addressed the issue of hardware-related software errors [Iye85]. A recent analysis of data from the IBM/MVS system investigated software defects and their impact on system availability [Sul91]. A discussion of issues of software reliability in the system context, including the effect of hardware and management activities on software reliability and failure models, was presented in [Hec86]. Methodologies and advances in experimental analysis of computer system dependability over the past 15 years are reviewed in [Iye93]. 11.3 MEASUREMENTS For this study, measurements were made on three operating systems: the Tandem/GUARDLAN system, the VAX/VMS system, and the IBM/MVS system. Table 11.1 summarizes the mea- sured systems. These systems are representative of the field in that they have varying degrees of fault tolerance embedded in the operating system. The following subsections introduce the three systems and measurements. Details of the measurements and data processing can be found in [Hsu87, Lee92, Lee93b, Tan92b, Vel84]. Table 11.1 Measured systems HW/SW System || Architecture | Fault-Tolerance | Workload Tandem/GUARDIAN Distributed Single-Failure 1) Software Development Tolerance 2) Customer Applications IBM3081/MVS Single Recovery System Design/Development Management VAXcluster/VMS Distributed Quorum 1) Scientific Applications Algorithm 2) Research Applications 4 . IYER and LEE 113.1 Tandem/GUARDIAN The Tandem/GUARDIAN system is a message-based multiprocessor system built for on-line transaction processing. High availability is achieved via single-failure tolerance. A critical system function or user application is replicated on two processors as the primary and backup processes, i.e., as process pairs. Normally, only the primary process provides service. The primary sends checkpoints to the backup so that the backup can take over the function on a failure of the primary. A software failure occurs when the GUARDIAN system software detects nonrecoverable errors and asserts a processor halt. The "I'm alive" message protocol allows the other processors to detect the halt and take over the primaries which were executing on the halted processor. A class of faults and errors that cause software failures was collected. Two types of data were used: human-generated software failure reports (used in Section 11.4.1 and Section 11.5.1) and on-line processor halt logs (used in Section 11.4.2, Section 11.4.4, and Section 11.6.1). Human-generated software failure reports provide detailed information about the underlying faults, failure symptoms, and fixes. Processor halt logs provide near-100% of reporting and accurate timing information on software failures and recovery. The source of human-generated software failure reports is the Tandem Product Report (TPR) database. A TPR is used to report all problems, questions, and requests for enhancements by customers or Tandem employees concerning any Tandem product. A TPR consists of a header and a body. The header provides fixed fields for information such as the date, customer and system identifications, and brief problem description. The body of a TPR is a textual description of all actions taken by Tandem analysts in diagnosing a problem. If a TPR reports a software failure, the body also includes the log of memory dump analyses performed by Tandem analysts. Two-hundred TPRs consisting of all reported software failures in all customer sites during a time period in 1991 were used. The processor halt log is a subset of the Tandem Maintenance and Diagnostic System (TMDS) event log maintained by the GUARDIAN operating system. Measurements were made on five systems—one field system and four in-house systems—for a total of five system- years. Software failures are rare in the Tandem system, and only one of the in-house systems had enough software failures for a meaningful analysis. This system was a Tandem Cyclone system used by Tandem software developers for a wide range of design and development experiments. It was operating as a beta site and was configured with old hardware. As such, it is not representative of the Tandem system in the field. The measured period was 23 months. 11J.2 IBM/MVS The MVS is a widely used IBM operating system. Primary features of the system are reported to be efficient storage management and automatic software error recovery. The MVS system attempts to correct software errors using recovery routines. The philosophy in the MVS is that for each major system function, the programmer envisions possible failure scenarios and writes a recovery routine for each. It is, however, the responsibility of the installation (or the user) to write recovery routines for applications. The detection of an error is recorded by an operating system module. Measurements were made on an IBM 3081 mainframe running the IBM/MVS operating system. The system consisted of dual processors with two multiplexed channel sets. Time- stamped, low-level error and recovery data on errors affecting the operating system functions SOFTWARE FAULT TOLERANCE IN COMPUTER OPERATING SYSTEMS 5 were collected. During the measurement period, the system was used primarily to provide a time-sharing environment to a group of engineering communities for their daily work on system design and development. Two measurements were made. The measurement periods were 14 months and 12 months. The source of the data was the on-line error log file produced by the IBM/MVS operating system. 11.3.3 VAX/VMS A VAXcluster is a distributed computer system consisting of several VAX machines and mass storage controllers connected by the Computer Interconnect (CI) bus organized as a star topology [Kro86]. One of the VAXcluster design goals is to achieve high availability by integrating multiple machines in a single system. The operating system provides the cluster- wide sharing of resources (devices, files, and records) among users. It also coordinates the cluster members and handles recoverable failures in remote nodes via the Quorum algorithm. Each operating system running in the VAXcluster has a parameter called VOTES and a parameter called QUORUM. If there are n machines in the system, each operating system usually sets its QUORUM to [n/2 + 1J. The parameter VOTES is dynamically set to the number of machines currently alive in the VAXcluster. The processing of the VAXcluster proceeds only if VOTES is greater than or equal to QUORUM. Thus, the VAXcluster functions like an [n/2 + IJ-out-of-n system. The two measured VAXclusters had different configurations. The first system, VAX1, was located at the NASA Ames Research Center, a typical scientific application environment. It consisted of seven machines (four 11/785's, one 11/780, one 11/750, and one 8600) and four controllers. The data collection periods for the different machines in VAX1 varied from 8 to 10 months (from October 1987 through August 1988). The second system, VAX2, was located at the University of Illinois, an academic research and student application environment. It consisted of four machines (two 6410's, one 6310, and one 11/750) and one controller. The data collection period was 27 months (from January 1989 through March 1991). The source of the data was the on-line error log file produced by the VAX/VMS operating system. 11.4 BASIC ERROR CHARACTERISTICS In this section, we investigate basic error characteristics using the measured data. These include fault and error profile, time to error (TTE) and time to recovery (TTR) distributions, and correlated software failures. 11.4.1 Fault and Error Classification Collection of software faults and errors identified naturally reflect the characteristics of the software development environment. Many studies attempted.to tune the software development process by analyzing the faults identified during the development phase [Tha78, End75, Bas84]. However, fault and error profiles of operational software can be quite different from those of the software during the development phase, due to the differences in the operational environment and software maturity. Therefore, it is necessary to investigate the fault and error profiles of operational software. Also, software fault and error categorization for the three measured operating systems is important because they are widely used operating systems. In IYER and LEE order to be of value to the community at large, such a knowledge should be accumulated in a public domain database that is regularly updated. Results of such categorization can then be used for testing and for designing efficient on-line error detection and recovery strategies as well as for fault avoidance. 11.4.1.1 GUARDIAN We studied the underlying causes of 200 Tandem Product Reports (TPRs) consisting of all software failures reported by users for a time period in 1991 [Lee93b]. Twenty-one of the 200 TPRs were due to nonsoftware causes. Underlying causes of these failures indicate that hardware and operational faults sometimes cause failures that look as though they are due to software faults. Our experience shows that determining whether a failure is due to software faults is not always straightforward. This is partly because of the complexity of the system and partly because of close interactions between software and hardware platforms in the system. In 26 out of the remaining 179 TPRs, analysts believed that the underlying problems were software faults but had not yet located the faults. These are referred to as unidentifiedproblems. Table 11.2 shows the results of a fault classification using 153 TPRs whose software causes were identified. The table shows both the number of TPRs and the number of unique faults. Differences between the two represent multiple failures due to the same fault. The numbers inside parentheses show a further subdivision of a fault class. Table 113, Software fault classification in GUARDIAN Fault Class ^Faults #TPRs Incorrect computation 3 3 Data fault 12 21 Data definition fault 3 7 Missing operation: 20 27 - Uninitialized pointer (6) (7) - Uninitialized nonpointer variable (4) (6) - Not updating.data structure on the occurrence of event (6) (9) - Not telling other processes about the occurrence of event (4) (5) Side effect of code update 4 5 Unexpected situation: 29 46 - Race/timing problem (14) (18) - Errors with no defined error handling procedures (4) (8) - Incorrect parameter or invalid call from user process (3) (7) - Not providing routines to handle legitimate (8) (13) but rare operational scenarios Microcode defect 4 Other (cause does not fit any of the above class) 10 12 Unable to classify due to insufficient information 15 24 All 100 153 Table 11.2 shows what kinds of faults the developers introduced. In the table, the faults were ordered by the difficulty in testing and identifying them. "Incorrect computation" means SOFTWARE FAULT TOLERANCE IN COMPUTER OPERATING SYSTEMS 7 an arithmetic overflow or the use of an incorrect arithmetic function (e.g., use of a signed arithmetic function instead of an unsigned one). "Data fault" means the use of an incorrect constant or variable. "Data definition fault" means a fault in declaring data or in defining a data structure. "Missing operation" means that a few lines of source code were omitted. A "side effect" occurs when not all dependencies between software components are considered when updating software. "Unexpected situation" refers to cases in which software designers did not anticipate a potential operational situation and the software does not handle the situation correctly. Table 11.2 shows that "missing operation" and "unexpected situation" are the most common causes of TPRs. Additional code inspection and testing efforts can be directed to such faults. A high proportion of simple faults, such as incorrect computations or missing operations, is usually observed in new software, while a high proportion of complex causes, such as unexpected situations, is usually observed in mature software. The coexistence of a significant number of simple and complex faults is not surprising, because the measured system is a large software system consisting of both new and mature components. Further, some customer sites run earlier versions of software, while other sites run later versions. Yet one would like to see fewer simple faults. The existence of a significant proportion of simple faults indicates that there is room for improving the code inspection and testing process. A software failure due to a newly found fault is referred to as a first occurrence, and a software failure due to a previously-reported fault is referred to as a recurrence. Out of the 153 TPRs whose underlying software faults were identified, 100 were due to unique faults. Out of the 100 unique faults, 57 were diagnosed before our measurement period. Therefore, 43 new software faults were identified during the measurement period. That is, about 72% (110 out of 153) of the software failures were recurrences of previously-reported faults. Considering that a quick succession of failures at a site, failures likely to be due to the same fault, is typically reported in a single TPR, the actual percentage of recurrences can be higher. This shows that, in environments where a large number of users run the same software, software development is not the only factor that determines the quality of software. Recurrences can seriously degrade software dependability in the field. Clearly, the impact of recurrences on system dependability needs to be modeled and evaluated. 11.4.1.2 MVS In MVS, software error data, such as the type of error detection (hardware and software), error symptom, severity, and the results of hardware and software attempts to recover from the problem, are logged by the system. The error symptoms provided by the system were grouped into classes of similar errors. The error classes were chosen to reflect commonly encountered problems. Six classes of errors were defined [Vel84]: 1. Control: indicates the invalid use of control statements and invalid supervisor calls. 2. I/O and data management: indicates a problem occurred during I/O management or during the creation and processing of data sets. 3. Storage management: indicates an error in the storage allocation/deallocation process or in virtual memory mapping. 4. Storage exceptions: indicates addressing of nonexistent or inaccessible memory locations. 5. Programming exceptions: indicates a program error other than a storage exception. 8 IYER and LEE 6. Timing: indicates a system or operator-detected endless loop, endless wait state, or violation of system or user-defined time limits. Table 11.3 shows the percentage distribution of the errors during the measured period. On the average, the three major error classes are storage management (40%), storage exceptions (21%), and I/O and data management (19%). This result is probably related to the fact that a major feature of MVS is the multiple virtual storage organization. Storage management and I/O and data management are high-volume activities critical to the proper operation of the system. Therefore, one might expect their contributions to errors to be significant. Table 113 Software error classification in MVS (measurement period: 14 months) Error Type (I Frequency | Fraction (%) Control 22 5.5 Timing 29 7.3 I/O and Data Management 74 18.5 Storage Management 161 40.4 Storage Exceptions 82 20.6 Programming Exceptions 31 7.8 All 399 100.0 11.4.1.3 VMS Software errors in a VAXcluster system are identified from "bugcheck" reports in the error log files. All software detected errors were extracted from bugcheck reports and divided into four types in [Tan92c]: 1. Control: problems involving program flow control or synchronization, for example, "Un- expected system service exception," "Exception while above ASTDEL (Asynchronous System Traps DELivery) or on interrupt stack," and "Spinlock(s) of higher rank already owned by CPU." 2. Memory: problems referring to memory management or usage, for example, "Bad memory deallocation request size or address," "Double deallocation of memory block," "Pagefault with IPL (Interrupt Priority Level) too high," and "Kernel stack not valid." 3. I/O: inconsistent conditions detected by I/O management routines, for example, "Inconsis- tent I/O database," "RMS (Record Management Service) has detected an invalid condition," "Fatal error detected by VAX port driver," "Invalid lock identification," and "Insufficient nonpaged pool to remaster locks on this system." 4. Others: other software-detected problems, for example, "Machine check while in kernel mode," "Asynchronous write memory failure," and "Software state not saved during pow- erfail." Table 11.4 shows the frequency for each type of software-detected error for the two VAX- cluster systems. Nearly 13% of software-detected errors are type "Others," and -almost all of them belong to VAX2. The VAX2 data showed that most of these errors were "machine check" (i.e., CPU errors). It seemed that the VAX1 error logs did not include CPU errors in the bugcheck category. A careful study of the VAX error logs and discussions with field engineers SOFTWARE FAULT TOLERANCE IN COMPUTER OPERATING SYSTEMS 9 indicate that different VAX machine models may report the same type of error (in this case, CPU error) to different classes. Thus, it is necessary to distinguish these errors in the error classification. Most "Others" errors were judged to be nonsoftware problems. Table 11.4 Software error classification in VMS (Measurement period: 10 months for VAX1 and 27 months for VAX2) Error Type || Frequency (VAX 1) Frequency (VAX2) Fraction (%), Combined Control 71 26 50.0 Memory 8 4 6.2 UO 16 44 30.9 Others 1 24 12.9 All 96 98 100.0 11.42 Error Distributions Time to error (TTE) and time to failure (TTF) distributions provide the information on error and failure arrivals. Figure 11.1 shows the empirical TTE or TTF distributions fitted to analytic functions for the three measured systems. Here, a failure means a processor failure, not a system failure. An error is defined as a nonstandard condition detected by the system software. Due to the differences in semantics and logging mechanisms between the measured systems, a direct comparison of the distributions is not possible. But we can make high level observations that point to relevant dependability issues. None of the distributions in Figure 11.1 fit simple exponential functions. The fitting was tested using the Kolmogorov-Smirnov or Chi-square test at the 5% significance level. This result conforms to the previous measurements on IBM [Iye85] and DEC [CasSl] machines. Several reasons for this nonexponential behavior, including the impact of workload, were documented in [CasSl]. The two-phase hyperexponential distribution provided satisfactory fits for the VAXcluster software TTE and Tandem software TTF distributions. An attempt to fit the MVS TTE distribution to a phase-type exponential distribution led to a large number of stages. As a result, the following multistage gamma distribution was used: (11.1) where a, > 0, £"_, a, = 1, and t<s, g(t;a,s) = (11.2) t>s. It was found that a 5-stage gamma distribution provided a satisfactory fit. Figure 1 l.lb and Figure 1 l.lc show that the measured software TTE-and TTF distributions can be modeled as a probabilistic combination of two exponential random variables, indicating 10 IYER and LEE (a) IBM MVS Software TTE Distribution /(0 = 1,-1) + 0.55-sU;0.5.0) + 0.069-«((;3.5,3) + 0.030-g («;5.0,8) + 0.098-g (> ;5.0,1.7) 100 200 300 400 t (minutes) (b) VAXcluster Software TTE Distribution (c) Tandem Software TTF Distribution .12 a, =0.67 A., =0.20 .08. o,=0.87 X,=0.10 02=0.33 Xj=2.75 02=0.13 /(') .00. .00- 20 25 Figure 11.1 Empirical software ITt/lTr-' distributions thai there are two dominant error modes. The higher error rate, A , with occurrence probability 2 02, captures both the error bursts (multiple errors occurring on the same operating system within a short period of time) and concurrent errors (multiple errors on different instances of an operating system within a short period of time) on these systems. The lower error rate, AI , with occurrence probability ori, captures regular errors and provides an interburst error rate. Error bursts are also significant in MVS. They are not clearly shown in Figure 11.1 a because each error burst was treated as a single situation, called a multiple error. (The characteristics of multiple errors and their significance are discussed in Section 11.6.2.) The above results show that error bursts need to be taken into account in the system design and modeling. The inclusion of error bursts in a model can cause a stiffness problem which may require improved solution methods. Error bursts, which are near-coincident problems, can affect recovery/retry techniques because additional errors can hit the system while it is recovering from the first error. Hence design tradeoffs between performing a rapid recovery and a full-scale power-on-self-test (POST) need to be investigated. Error bursts can also be repeated occurrences of the same software problem or multiple effects of an intermittent hardware fault on the software. Software error bursts have been observed in laboratory experiments reported in [Bis88]. This study showed that, if the input sequences of the software under investigation are correlated (rather than being independent), one can expect more "bunching" of failures than those predicted using a constant failure rate assumption. In an operating system, input sequences (user requests) are highly likely to be correlated. Hence, a defect area can be triggered repeatedly.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.