National Aeronautics and Space Administration Johnson Space Center’s Risk and Reliability Analysis Group Analyzing Data for the Decisions of Today and Tomorrow 2008 Annual Report Information contained in this document has been determined by NASA to be in the public domain. Public domain refers to information that is published and generally accessible or available to the public through various media. TABLE OF CONTENTS FOREWORD ...................................................................................................................... iv INTRODUCTION .............................................................................................................. 1 PRA .......................................................................................................................................................... 2 R&M Analysis ......................................................................................................................................... 4 Data Collection, Analysis, and Management ...................................................................................... 6 2008 ANALYSIS TASKS ...................................................................................................... 7 Shuttle ...................................................................................................................................................... 7 Shuttle PRA Tasks ............................................................................................................................. 8 Shuttle R&M Tasks .......................................................................................................................... 14 Shuttle Data Management Tasks ................................................................................................... 16 Constellation ......................................................................................................................................... 17 Constellation PRA Tasks ................................................................................................................ 17 Constellation R&M Tasks ............................................................................................................... 24 ANALYSIS GROUP STAFF ............................................................................................. 26 JSC S&MA Analysis Branch ............................................................................................................... 26 JSC S&MA Support Services Contractors ........................................................................................ 32 ACRONYMS ...................................................................................................................... 33 2008 Annual Report iii FOREWORD The Johnson Space Center (JSC) Safety & Mission Assurance (S&MA) Directorate’s Risk and Reliability Analysis Group provides both mathematical and engineering analysis expertise in the areas of Probabilistic Risk Assessment (PRA), Reliability and Maintainability (R&M) analysis, and data collection and analysis. The fundamental goal of this group is to provide National Aeronautics and Space Administration (NASA) decision- makers with the necessary information to make informed decisions when evaluating personnel, flight hardware, and public safety concerns associated with current operating systems as well as with any future systems. The Analysis Group includes a staff of statistical and reliability experts with valuable backgrounds in the statistical, reliability, and engineering fields. This group includes JSC S&MA Analysis Branch personnel as well as S&MA support services contractors, such as Science Applications International Corporation (SAIC) and SoHaR. The Analysis Group’s experience base includes nuclear power (both commercial and navy), manufacturing, Department of Defense, chemical, and shipping industries, as well as significant aerospace experience—specifically in the Shuttle, International Space Station (ISS), and Constellation Programs. The Analysis Group partners with project and program offices, other NASA centers, NASA contractors, and universities to provide additional resources or information to the group when performing various analysis tasks. The JSC S&MA Analysis Group is recognized as a leader in risk and reliability analysis within the NASA community. Therefore, the Analysis Group is in high demand to help the Space Shuttle Program (SSP) continue to fly safely, assist in designing the next generation spacecraft for the Constellation Program (CxP), and promote advanced analytical techniques. The Analysis Section’s tasks include teaching classes and instituting personnel qualification processes to enhance the professional abilities of our analysts as well as performing major probabilistic assessments used to support flight rationale and help establish program requirements. During 2008, the Analysis Group performed more than 70 assessments. Although all these assessments were important, some were instrumental in the decision- making processes for the Shuttle and Constellation Programs. Two of the more significant tasks were the Space Transportation System (STS)-122 Low Level Cutoff PRA for the SSP and the Orion Pad Abort One (PA-1) PRA for the CxP. These two activities, along with the numerous other tasks the Analysis Group performed in 2008, are summarized in this report. This report also highlights several ongoing and upcoming efforts to provide crucial statistical and probabilistic assessments, such as the Extravehicular Activity (EVA) PRA for the Hubble Space Telescope service mission and the first fully integrated PRAs for the CxP’s Lunar Sortie and ISS missions. Roger L. Boyer JSC S&MA Analysis Branch Chief 2101 NASA Parkway, Mail Code NC Houston, Texas 77058 [email protected] 281.483.6070 iv 2008 Annual Report INTRODUCTION The Risk and Reliability Analysis Group was formed in 2003, under a general reorganization that formed the S&MA Directorate from the former Safety, Reliability, and Quality Assurance (SR&QA) Directorate at JSC. The Analysis Group is part of the Analysis Branch in the Shuttle and Exploration Division. The figure below shows the Analysis Group’s organizational structure within JSC. NASA, Johnson Space Center Safety and Mission Assurance Directorate Shuttle and Exploration Division NC4/Analysis Branch Chief: Roger Boyer Risk & Reliability Analysis Group Bob Cross – Group Lead and Constellation PRA Mark Bigler – Cx PRA Teri Hamlin – Shuttle PRA Richard Heydorn – Cx R&M Bruce Reistle – Data Lead Henk Roelant – CEV & S/W Michael Stewart – CEV PRA Mark Valentine – Shuttle R&M Scott Winter – Lunar Lander The NASA Procedural Requirements, NASA Space Flight Program and Project Management Requirements (NPR 7120.5D), March 06, 2007, establishes the requirement for risk management and analysis, more specifically PRA, to be used in all NASA projects and programs. The NASA Policy Directive, NASA Reliability and Maintainability (R&M) Program Policy (NPD 8720.1B), April 29, 2004, establishes a similar requirement for R&M analyses. The Analysis Group assists NASA programs and projects in meeting these obligations to ensure decisions concerning risks are informed, vehicles are safe and reliable, and program/project requirements are realistic and realized. Probabilistic risk assessment, reliability and maintainability analysis, and data collection enable the Analysis Group to provide crucial reliability and failure information that NASA uses to support many of its safety-related decisions. 2008 Annual Report 1 PRA PRA is a comprehensive, structured, and disciplined approach for identifying, analyzing, and quantifying risks in engineered systems. PRA is primarily used as a decision support tool that uncovers design and operational weakness in engineered systems and then helps to systematically identify and prioritize safety improvements. PRAs must adequately represent the system design and operation as well as use standard and consistent PRA methods, practices, and applications. The complexity of a PRA is dependent on the complexity of the system being assessed and the questions to be answered. Complex system assessments require a team of PRA analysts and domain experts working together. The purpose and scope of a PRA drives the selection of PRA methods used for the analysis. The SSP uses PRA to assess mission risks as an input to its risk-informed decision-making process. The CxP and its project offices are using PRA during the conceptual phase and will continue using PRA throughout the life of the program to evaluate mission, system, element, and subsystem level risks both within and across projects. PRA is also used to perform focused risk studies. The knowledge gained about the risks to a system may then be used by management to cost-effectively improve the system’s safety and performance in the face of uncertainties by making risk-informed decisions. If a PRA is performed early in the design and development cycle and the engineering and operations communities are actively engaged in performing the PRA, the PRA becomes an effective design tool for verifying risk requirements, performing risk trade studies, and reducing uncertainties. In general, PRA is a process that seeks answers to three basic questions: What kinds of events or scenarios can occur (i.e., what can go wrong)? What are the likelihoods and associated uncertainties of the events or scenarios? W hat consequences could result from these events or scenarios (e.g., loss of crew or loss of mission)? 2 2008 Annual Report The figure below provides an overview of the PRA process. Defining the PRA Study Scope and Objectives Initiating Events Identification Event Sequence Diagram (Inductive Logic) End State: L OC IE A B End State: OK EEnndd S Statatete: :L EOSM2 End State: ES2 C D E End State: LOM End State: LOC Event T ree (ET ) Mode ling Fault Tree (FT) System Modeling Mapping of ET-defined Scenarios to Causal Events IE A B C D E SEtantde Not A 1: OK Logic Gate Basic Event (cid:137) Internal initiating events One of these events 2: LOM (cid:137)(cid:137) EHxatredrwnaarl ein fiatiia luti rneg events AND 3: LOC (cid:137) Human error 4: LOC (cid:137) Software error one or more 5: LOC (cid:137)(cid:137) CEnovmirmo nomn ecnatuasl ec ofanidluitrieons eleomf tehnetsaery 6: LOC (cid:137) Other events Link to another fault tree 123456000000EPPP xrrroooabbbmaaa0pbbb.liiielll0iiittts1Pyyy (tttfrhhhr0ooaaa.mttt b0ttt hhhl2eaeee f r tbch e0tra oe.wir lwdr0oiiswg3 ufha latdtiri)l 0e c:bt . oex 0 Tpaf411223ae 5 05050wrirl fse ion wradmhyt 0e acmn. ot 0anne2sdek ient i0do.ten 0d oa 4t ft h0Be. 0tai6m se0 ioc.f0 12345l8aE00000n dvinegn 0.t0s2 0.04 0.06 0.08 Diisso ucMmosoerarddiene ci nlEt lL yxd opcageataircpt s taau nenreandlds yD usinraise tma t hoAadnte asl lyyassntides mRap efpavriiloeupwreri aloteg idca ta 10M246800000odel 0I.n01teEgnrd0a .S0t2tia otEenn: dL OaS0C.tna0 3ted : LQOMu0 .a04n tificaIlauontnng0ticde.oice0 g(cid:137)(cid:137)p5 rnrs tra aot r tipiunoomseluacitnkcvnigtfie euenec aaslrnineRintm ehiattrodosstoarain ) ioql(sooi Endocusbkt uTfa tyo iatnens fsii Stpnneratiifei tsinstcsrchktdma ee ( estmrFsiicos Tnioencksfn a )obafarrisoioiscs (cid:137)likelihood estimates The uncertainty in occurrence frequency of an event is characterized by a probability distribution Technical Review of Results and Interpretation Communicating & Documenting Risk Results and Insights to Decision-maker (cid:137) Displaying the results in tabular and graphical forms (cid:137) Ranking of risk scenarios (cid:137) Ranking of individual events (e.g., hardware failure, human errors, etc.) (cid:137) Insights into how various systems interact (cid:137) Tabulation of all the assumptions (cid:137) Identification of key parameters that greatly influence the results (cid:137) Presenting results of sensitivity studies (cid:137) Proposing candidate mitigation strategies The following paragraphs summarize PRA services the Analysis Group provides as well as some of the PRA tools they use. Scenario Modeling uses inductive logic and probabilistic tools such as Event Sequence Diagrams (ESDs) and event trees to model each scenario. ESDs help the analysts and the review team identify the failure logic associated with the system or scenarios being developed. Event trees are developed from the ESDs to quantify the failure scenarios. Failure Modeling uses deductive logic and probabilistic tools called fault trees to model each failure (or its complement, success) for a pivotal event in a failure scenario. Fault trees consist of three parts. The topmost element (top event) is a given pivotal event defined in a failure scenario. The second part of the fault tree consists of intermediate events that cause the top event. These events are linked through logic gates (i.e., AND gates and OR gates) to the basic events. The basic events are the third part of the fault tree, and their occurrence ultimately causes the top event. 2008 Annual Report 3 Quantification and Integration is a process that uses an integrated PRA computer program to logically link and quantify the fault trees appearing in the path of each scenario. The frequency of occurrence for each end state in the event tree is the product of the initiating event’s frequency and the (conditional) probabilities of the pivotal events along the scenario path linking the initiating event to the end state. The scenarios are then grouped according to the end state of the scenario defining the consequence. Finally, all end states are then grouped (i.e., their frequencies are summed into the frequency of a representative end state). Uncertainty Analysis is part of the quantification process that evaluates the degree of knowledge or confidence in the calculated numerical risk results. Monte Carlo simulation methods are generally used to perform uncertainty analysis; although, other methods exist. Sensitivity Analysis is frequently performed in a PRA to indicate analysis inputs or elements whose value changes cause the greatest changes in partial or final risk results. Sensitivity analysis identifies system components that, if modified, will have a greater impact on the overall system risk. Importance Ranking is a special technique used in some PRA applications to identify the lead, or dominant, contributors to risk in accident sequences or scenarios by listing the lead contributors in decreasing order of importance. This process is generally performed first at the fault tree level and then at the event tree levels. Analysts usually use an integrated PRA computer program to establish the different types of risk importance measures in the importance ranking process. R&M Analysis Reliability engineering assesses the probability that a given component or system will operate as designed. Maintainability engineering assesses and verifies the system design characteristics to reduce the need for maintenance and ensure downtime is minimized when maintenance action is necessary. R&M analysis results are used to allocate design resources, focus operations on potential trouble areas, and identify requirements for spares inventories. Reliability engineering also includes a process called trending, which assesses the reliability performance of systems and components during their missions and identifies changes in reliability performance over time. Through design evaluation, (probabilistic) modeling, analysis, and testing; reliability engineers work to improve the dependability of NASA systems. Reliability analyses are used to support PRA and logistics. The following paragraphs summarize R&M services the Analysis Group provides as well as some of the R&M tools they use. 4 2008 Annual Report Physics of Failure Analysis identifies the underlying physical processes and mechanisms that cause failure. This analysis helps minimize the risk of failures by enabling analysts and decision-makers to understand the relationship between failures and their driving parameters (environmental, manufacturing process, material defects, etc.). Physics of failure analysis is useful throughout all phases of a program from technology development and design to operations. Root Cause Analysis ensures that problems are systematically evaluated and corrected. The key element in a root cause analysis is a good Problem Reporting and Corrective Action (PRACA) System. PRACA is a closed-loop system for documenting hardware and software anomalies, analyzing their impact on R&M, and tracking them to their resolution. PRACA is a prime data source for program- and project-specific failure histories. Reliability Assurance Plans identify the activities essential in assuring reliability performance requirements are met during design, production, and product assurance activities. These plans are written during program/project planning and apply throughout the program’s/project’s life. Adherence to these plans ensures that design risks are balanced against the program’s/project’s constraints and objectives. Reliability Modeling uses prediction, allocation, and modeling tasks to identify inherent reliability characteristics. Reliability modeling aids in evaluating the reliability of competing designs. It is used in design and in operations when failure rates are needed for tradeoff studies, sparing analysis, etc. Reliability modeling results are often used to establish procurement specifications. Trend Analysis examines past results and evaluates variation in data with the ultimate objective of forecasting future events. Typically, trend analysis is used in the operational phase of a program to provide a means for assessing whether a system or component is in its break-in, operational, or wear-out phase. Trend analysis is also useful in determining if an external factor is affecting a system or component. Regression Analysis evaluates the relationship between a dependent variable and one or more independent variables and generates an equation to describe the effect of one variable upon another. The most commonly used method for modeling the relationship is least squares, but other methods are available. The least squares method assesses the “statistical significance,” or the degree of confidence, that the true relationship is close to the estimated relationship. Once the relationship, or model, between the variables is obtained, the model can be used to further investigate the root cause or to predict the value of the dependent variable. Reliability Growth is the improvement in a reliability parameter over a period of time due to changes in product design or the manufacturing process. It occurs by surfacing failure modes and implementing effective corrective actions. Reliability growth management involves systematically planning for reliability achievement as a function of time and other resources, and controlling the ongoing rate of achievement by reallocating these resources based on comparisons between planned and assessed reliability values. 2008 Annual Report 5 Weibull Analysis matches historical failure and repair data to appropriate Weibull distributions. These distributions represent the failure or repair characteristics of a given failure mode and may be assigned to failure models that are attached to blocks in a reliability block diagram or events in a fault tree diagram. Weibull analysis results are typically given by two parameters that describe the distribution curve. These parameters are β, the shape parameter, and η, the scale parameter (characteristic life). β is useful in determining the failure characteristics of the data. If the failure rate is increasing, then β is greater than 1; if the failure rate is decreasing, then β is less than 1; or if the failure rate is constant, then β equals 1. Simulation is a problem solving technique that approximates the probability of certain outcomes by running multiple trial runs, called simulations, using random variables. Simulation is often used when the system to be modeled is too complex to develop a closed-formed mathematical solution for the reliability problem. The three classic types of reliability simulators are: Monte Carlo, reliability block diagram, and queuing. Data Collection, Analysis, and Management Data is the essential component (the life blood) of PRA and R&M analysis. Various types of data must be collected and processed for use throughout the PRA process and in R&M analyses. NASA gathers data from a variety of sources within and outside of NASA. One of the primary data sources for each NASA program is its PRACA System. The PRACA Database records typically provide the failure data for the program. External data sources may include the Reliability Analysis Center Automated Databook’s Nonelectronic Parts Reliability Data and Electronic Parts Reliability Data, the National Transportation Safety Board, the Nuclear Computerized Library for Assessing Reactor Reliability, and other sources. The Analysis Group analyzes both internal and external data for the problem, system, or component under study to support PRAs and R&M analyses. Once data is collected for a particular study, the Analysis Group stores or maintains the data so it can be retrieved and referenced for future purposes. Data collection and analysis proceeds in parallel, or in conjunction, with PRA and R&M analysis. Data is assembled to quantify accident scenarios and contributors. Data includes, but is not limited to, component failure rates, repair times, initiating event probabilities, structural failure probabilities, human error probabilities, process failure probabilities, and common cause failure probabilities. Uncertainty bounds and uncertainty distributions are also collected and developed in the data collection process. 6 2008 Annual Report