· 6 years ago · Sep 08, 2019, 08:20 PM
13 Reliability and Human Error in Systems
2It is evident that human reliability is too often (some would argue always) the weak link in the chain of events which leads to loss of safety and catastrophic failure of even the most advanced technological systems.
3—B.A. Sayers (1988)
4INTRODUCTION
5We all make mistakes, and sometimes those mistakes lead to a system failure. For example, while driving, it is easy to get distracted. You might be talking to a friend on a cell phone or trying to insert a disk into the CD player and so fail to see that the car in front of you has stopped. You might just crash into the car in front without braking, or you might not have time enough to complete the braking action to prevent a crash. Allowing your attention to be taken off from the road for whatever reason was an error, and errors often have adverse consequences.
6In July 2001, 6-year-old Michael Colombini was brought to the Westchester Medical Center, New York, for a magnetic resonance imaging (MRI) exam following surgery. An MRI machine consists of a very large and powerful magnet. Michael was placed in the middle of this magnet for his scan. Regulations about what can and cannot be brought into MRI exam rooms are very explicit; even paperclips are not allowed because they will be drawn into the center of the magnet at high speed when the machine is turned on. Nonetheless, someone brought an oxygen tank into the exam room. The heavy tank was drawn into the center of the magnet, and Michael died of blunt-force trauma to the head.
7The United States enjoys one of the most efficient, effective, and safe medical systems in the world. When medical errors occur, like the one that cost Michael Colombini his life, they are usually human errors. We have all heard stories about botched surgeries, disastrous drug interactions and overdoses, poor treatment decisions, and mishandling of medical equipment, and these stories can overshadow the fact that citizens in the United States receive almost the best medical care in the world. Still, in 2000, the number of deaths in the United States resulting from medical errors was estimated to be anywhere between 44,000 and 100,000 people (Kohn et al., 2000).
8In that year, the U.S. President launched a series of initiatives to boost patient safety, with the goal of reducing preventable medical errors by 50% by 2005. Among the initiatives was a requirement that all hospitals participating in the government’s Medicare program be required to institute error-reduction programs and support for research on medical errors. This was accompan- ied by calls for the medical profession to implement a human factors approach to human error similar to that employed by the aviation industry (Crew Resource Management or CRM; see, e.g., Leape, 1994) and to integrate information about the sources of human error into the medical curriculum (Glavin & Maran, 2003). Despite the increased awareness, since 2000, of the importance of reducing medical error, Leape and Berwick (2005 p. 2385) concluded that
953
10
11 progress has been ‘‘frustratingly slow.’’ Deaths from medical errors were reduced only slightly by 2005, nowhere close to the 50% goal.
12As we discussed in Chapter 1, systems can be small (like the lighting system) or large (like the many people, equipment, policies, and procedures that compose a U.S. hospital). Within each system we can identify one or more operators, people in charge of using a machine, implementing a policy, or performing a task, who help guide the system toward achieving its goal. A primary mission of the human factors specialist is to minimize human error and so to maximize system performance. This requires the specialist to identify the tasks performed by the operator and determine possible sources of error. This information must then be incorporated into the design of the system, if performance is to be optimized. Before considering ways that the likelihood of human error can be evaluated, we must first consider the system concept and its role in human factors.
13CENTRAL CONCEPT IN HUMAN FACTORS: THE SYSTEM
14A human–machine system is a system in which an interaction occurs between people and other system components, such as hardware, software, tasks, environments, and work structures. The system may be simple, such as a human interacting with a tool, or it may be complex, such as a flexible manufacturing system (Czaja & Nair, 2006, p. 32).
15A system operates for the purpose of achieving a goal. A hospital operates to cure disease and repair injury. An automobile operates to move people from one place to another. As human factors specialists, we believe that the application of behavioral principles to the design of systems will lead to improved functioning of those systems and will increase our abilities to achieve our goals. Indeed, the U.S. National Academy of Engineering has indicated the importance of the systems approach to engineering in general, stating, ‘‘Contemporary challenges—from biomedical devices to complex manufacturing designs to large systems of networked devices—increasingly require a systems perspective’’ (2005, p. 10).
16The system approach has its basis in systems engineering, a multidisciplinary approach to design that emphasizes the overall goals of the system or product under development during the design process (Kossiakoff & Sweet, 2003). Beginning with the identification of an operational need, designers determine the requirements of the system, which in turn results in a system concept. Designers implement this concept in a system architecture, dividing the system into optimized subsystems and components. For example, a hospital might plan a cancer research center, including state-of-the-art diagnostic devices (like an MRI machine), treatment facilities, counseling, and hospice care. Each of these separate components of the cancer center can be treated as a subsystem, tested and optimized, and then integrated into the overall system, which in turn is evaluated and tested. The result is the final cancer research center, which, hopefully, operates at peak performance.
17Systems engineering (as well as systems management) does not focus on the human component of the system (Booher, 2003b). This is the domain of the human factors specialist, who is concerned with optimizing the performance of the human subsystems, primarily through the design of human– machine interfaces, training materials, and so forth that promote effective human use. System analyses applied to the human component provide the basis for evaluating reliability and error, as well as for the design recommendations intended to minimize errors. They also provide the basis for safety assessment of existing technological systems such as nuclear power plants (Cacciabue, 1997).
18IMPLICATIONS OF THE SYSTEM CONCEPT
19Several implications of the system concept are important for evaluating human reliability and error (e.g., Bailey, 1996). These include the operator, the goals and structure of the system, its inputs and outputs, and the larger environment in which it is placed.
20
21 The operator is part of a human–machine system. We must evaluate human performance in applied settings in terms of the whole human–machine system. That is, we must consider the specific system performing in the operational environment and study human performance in relation to the system.
22The system goals take precedence over everything else. Systems are developed to achieve certain goals. If these goals are not achieved, the system has failed. Therefore, evaluations of all aspects of a system, including human performance, must occur with respect to the system goals. The objective of the design process is to satisfy the system goals in the best way possible.
23Systems are hierarchical. A system can be broken down into smaller subsystems, which in turn can be broken down into components, subcomponents, and parts. Higher levels in the system hierarchy represent system functions (i.e., what the system or subsystem is to accomplish), whereas lower levels represent specific physical components or parts. A human–machine system can be broken into human and machine subsystems, and the human subsystem can be characterized as having subgoals that must be satisfied for the overriding system goals to be met. In this case, the components and parts represent the strategies and elementary mental and physical acts required to perform certain tasks. We can construct a hierarchy of goals and subsystems by considering components within both the human and machine subsystems. Consequently, we can evaluate each subsystem relative to a specific subgoal, as well as to the higher-level goals within the system.
24Systems and their components have inputs and outputs. We can identify the inputs and outputs of each subsystem. The human factors specialist is particularly concerned with the input to the human from the machine and the actions that the human performs on the machine. Because the human subsystem can be broken down into its constituent subprocesses, we are also interested in the nature of the inputs and outputs from these subprocesses and how errors can occur.
25A system has structure. The components of a system are organized and structured in a way that achieves a goal. This structure provides the system with its own special properties. In other words, the whole operating system has properties that emerge from those of its parts. By analyzing the performance of each component within the context of the system structure, the performance of the overall system can be controlled, predicted, and improved. To emphasize the emergent proper- ties of a whole complex system, advocates of an approach called cognitive work analysis like to conceive of the entire system, humans and machines alike, as a single intelligent cognitive system rather than as separate human and machine subsystems (Sanderson, 2003).
26Deficiencies in system performance are due to inadequacies of system components. The total performance of a system is determined by the nature of the system components and their interactions with each other. Consequently, if the system design is appropriate for achieving certain goals, we must attribute system failures to the failure of one or more system components.
27A system operates within a larger environment. The system itself cannot be understood without reference to the larger physical and social environment in which it is embedded. If we fail to consider this environment in system design and evaluation, we will make an inadequate assessment of the system. Although it is easy to say that there is a distinction between the system and its environment, the boundary between them is not always clearly defined, just as the boundaries between subsystems are not always clearly defined. For example, a data management expert works at a computer workstation in the immediate environment of his or her office, but this office resides in the environment created by the policies and guidelines mandated by his or her employer.
28SYSTEM VARIABLES
29A system consists of all the machinery, procedures, and operators carrying out those procedures, which work to fulfill the system goal. There are two kinds of systems: mission oriented and service oriented (Meister, 1991). Mission-oriented systems subordinate the needs of their person- nel to the goal of the mission. These systems, like weapon and transport systems, are common in the military. Service-oriented systems cater to personnel, clients, or users. Such systems include supermarkets and offices.
30
31 Most systems fall between the extremes of mission and service orientations and involve components of both. For example, an automobile assembly plant has a mission component, that is, the goal of building a functional vehicle. However, it also has a service component in that the vehicle is being built for a consumer. Furthermore, assembly line workers, whose welfare is of concern to the system designers, build the vehicle. The company must service these workers to fulfill its mission to build automobiles.
32The variables that define a system’s properties, such as the size, speed, and complexity of the system, in part determine the requirements of the operator necessary for efficient operation of the system. Following Meister (1989), we can talk about two types of system variables. One type describes the functioning of the physical system and its components, whereas the other type describes the performance of individual and team operators. Table 3.1 lists some variables of each type.
33Physical system variables. Physical systems are distinguished by their organization and com- plexity. Complexity is a function of the number and arrangement of subsystems. How many subsystems operate at any one time, which subsystems receive inputs from and direct outputs to the other subsystems, and the ways that the subsystems or components are connected, all contribute to system complexity.
34The organization and complexity of the system determine interdependencies among subsys- tems. Subsystems that depend on others for their input and those that must make use of a common resource pool to operate are interdependent. For interdependent subsystems, the operation of one subsystem directly influences the operation of another because it provides inputs and uses resources required by another subsystem.
35An important characteristic of a system has to do with feedback. Feedback refers to input or information flow traveling backward in the system. Different systems may have different kinds of
36TABLE 3.1
37System Variables Identified by Meister
38Physical system variables
391. Number of subsystems
402. Complexity and organization of the system
413. Number and type of interdependencies within the system 4. Nature and availability of required resources
425. Functions and tasks performed by the system
436. Requirements imposed on the system
447. Number and specificity of goals
458. Nature of system output
469. Number and nature of information feedback mechanisms
4710. System attributes—for example, determinate=indeterminate, sensitive=insensitive 11. Nature of the operational environment in which the system functions
48Operator variables
491. Functions and tasks performed
502. Personnel aptitude for tasks performed
513. Amount and appropriateness of training
524. Amount of personnel experience and skill
535. Presence or absence of reward and motivation
546. Fatigue or stress condition
557. The physical environment for individual or team functioning 8. Requirements imposed on the individual or team
569. Size of the team
5710. Number and type of interdependencies within the team
5811. The relationship between individual=team and other subsystems
59Source: From Meister (1989).
60
61 feedback mechanisms and often have more than one. Feedback usually provides information about the difference between the actual and desired state of the system. Positive feedback is added to the system input and keeps the state of the system changing in its present direction. Such systems are usually unstable, because positive information flow can amplify error instead of correcting it. The alternative to positive feedback is negative feedback, which is subtracted from the system input. It is often beneficial for a system to include negative feedback mechanisms.
62Suppose a system’s goal is to produce premixed concrete. A certain amount of concrete requires some amount of water for mixing. If too much water is added, sand can be introduced to the mixture to dry it. A negative feedback loop would monitor the water content of the mixture, and this information would be used to direct the addition of more water or more sand until the appropriate mix had been achieved.
63Systems that make use of feedback are called closed-loop systems (see Figure 3.1a). In contrast, systems that do not use feedback are referred to as open-loop systems (see Figure 3.1b). Closed-loop systems that use negative feedback are error correcting because the output is continu- ously monitored. In contrast, open-loop systems have no such error detection mechanisms. In complex systems, there may be many feedback loops at different hierarchical levels of the system.
64The goals, functions, organization, and complexity of a system determine its attributes. As one example, a system can be relatively sensitive or insensitive to deviations in inputs and outputs. A small change in airflow probably will not affect the systems in a typical office building, but it might be devastating for the systems in a chemical processing plant. Also, systems can be determinate or indeterminate. Determinate systems are highly proceduralized. Operators follow specific protocols and have little flexibility in their actions. Indeterminate systems are not as highly proceduralized, and there is a wide range of activities in which the operators can engage. Also, in indeterminate systems, the operator’s response might be based on ambiguous inputs, with little feedback.
65Finally, systems operate in environments that may be friendly or unfriendly. Adverse conditions such as heat, wind, and sand take their toll on system components. For the system to operate effectively, the components must be able to withstand the environmental conditions around them.
66Operator variables. The requirements for system operators depend on the functions and tasks that must be performed for effective operation of the system. To perform these tasks, operators must meet certain aptitude and training requirements. For example, fighter pilots are selected according to aptitude profiles and physical characteristics, and they also must receive extensive training.
67Input
68(a) Closed-loop system Input
69(b) Open-loop system
70FIGURE 3.1 (a) Closed- and (b) open-loop systems.
71Output Environment
72Environmental change
73Output Environment
74 Interface
75Central processor
76Control mechanism
77 Desired change
78 Interface
79Central processor
80Control mechanism
81
82 Performance is also affected by motivation, fatigue, and stress. Depending on the levels of these factors, a person’s performance can vary from good to bad. Consider the problem of medical errors. Before July 2003, hospital residents routinely worked more than 80 h per week in shifts that frequently exceeded 30 h (see Lamberg, 2002). Because the guidelines differed at different hospitals, the true workload was unknown. There are many reasons why such demanding schedules are required of doctors in training, one being that the long duty shifts allow young doctors to observe the course of an illness or trauma from start to finish.
83However, sleep deprivation contributes to medical errors (Landrigan et al., 2004). A number of consumer advocacy and physicians’ groups petitioned the U.S. Occupational Health and Safety Administration (OSHA) to establish national limits on work hours for resident physicians in the early 2000s. Facing pending legislation, the U.S. Accreditation Council for Graduate Medical Education (ACGME) set new rules, which took effect in July 2003, that restrict the number of resident work hours to no more than 80 a week, in shifts no longer than 30 h, and provide at least one day off in seven.
84In the year after the ACGME’s new duty-hour standards, however, over 80% of interns reported that their hospitals were not complying with the standards, and they had been obliged to work more than 80 h per week. Over 67% of interns reported that their shifts had been longer than 30 h, and these violations had occurred during one or more months over the year (Landrigan et al., 2006). It is still difficult to determine the extent to which teaching hospitals are complying with the new standards: residents are reluctant to report violations, and residents often choose to work longer hours than they are scheduled to care for their patients (Gajilan, 2006). Even under the new standards, however, residents still make many more mistakes at the end of an extended shift, such as stabbing themselves with needles or cutting themselves with scalpels (Ayas et al., 2006), and their risk of being involved in auto accidents when traveling home after their shift is greatly increased (Barger et al., 2005).
85The demands placed on an individual will vary across different physical environments even in the absence of complicating factors such as stress and fatigue. Variables like temperature, humidity, noise level, illumination, and so on may exert their effects through increasing stress and fatigue (see Chapter 17). Also, when several people must work together to operate a system, team factors become important. The size of the team and the interrelations among the various team members influence the efficiency with which the team operates and, hence, the efficiency with which the system operates.
86SUMMARY
87The system concept, as developed in the field of systems engineering, is fundamental to the discipline of human factors. We must think about a system in terms of both its physical and mechanical variables and its individual and team operator variables. We must evaluate the perform- ance of the operators with respect to the functioning of the entire system. The assumptions and implications of the system concept dictate the way researchers and designers approach applied problems. The system concept is the basis for reliability analysis, which we consider later in the chapter, as well as for the information-processing approach to human performance, which is discussed in Chapter 4.
88HUMAN ERROR
89A human error occurs when an action is taken that was ‘‘not intended by the actor; not desired by a set of rules or an external observer; or that led the task or system outside its acceptable limits’’ (Senders & Moray, 1991, p. 25). Therefore, we see that whether an action is considered to be an error is determined by the goals of the operator and of the system. In some situations a slow or sloppy control action may not qualify as error, but in other situations it might.
90
91 For example, in normal flight, displacements of an aircraft a few meters above or below an intended altitude are not crucial and would not be classified as errors. However, in stunt flying, and when several planes are flying in formation, slight deviations in altitude and timing can be fatal. In 1988 at a U.S. air base in Ramstein, West Germany, three Italian Air Force jets collided in one of the worst-ever air show disasters. A total of 70 people, including the three pilots, were killed when one of the planes collided with two others and crashed into a crowd of spectators. The collision occurred when the jets were executing a maneuver in which one jet was to cross immediately above five jets flying in formation. A former member of the flying team concluded, ‘‘Either the soloist was too low or the group was too high. . . . In these situations a difference of a meter can upset calculations. . . . [This deviation could have been caused by] a sudden turbulence, illness or so many other things’’ (1988, UPI wire story). In this case, the system failed because of a very slight altitude error.
92The principal consideration of the human factors specialist is with system malfunctions that involve the operator. Although we typically refer to such errors as human errors, they frequently are attributable to the design of the human–machine interface or the training provided to the operator or both (Peters & Peters, 2006). Thus, the failure of a technological system often begins with its design. The system design can put the user in situations for which success cannot be expected. We restrict the term operator error to refer to those system failures that are due entirely to the human and the term design error to refer to those human errors that are due to the system design.
93WHY HUMAN ERROR OCCURS
94There are several viewpoints about what causes human error (Wiegmann & Shappell, 2001). From one perspective, human error can be traced to inadequacies of the system design: Because the system involves humans and machines operating within a work environment, the human is rarely the sole cause of an error. Inadequacies of system design fall into three groups (Park, 1987): task complexity, error-likely situations, and individual differences.
95Task complexity becomes an issue when task requirements exceed human capacity limits. As we see in later chapters, people have limited capacities for perceiving, attending, remembering, calculating, and so on. Errors are likely to occur when the task requirements exceed these basic capacity limitations. An error-likely situation is a general situational characteristic that predisposes people to make errors. It includes factors such as inadequate workspace, inadequate training procedures, and poor supervision. Finally, individual differences, which we talked about in Chapter 2, are the attributes of a person, such as abilities and attitudes, which in part determine how well he or she can perform the task. Some important individual differences are susceptibility to stress and inexperience, which can produce as much as a tenfold increase in human error probability (Miller & Swain, 1987).
96A second view about the causes of error is oriented around the cognitive processing required to perform a task (O’Hare et al., 1994). One assumption of cognitive models (described more fully in Chapter 4) is that, in the brain, information progresses through a series of processing stages from perception to initiation and control of action. Errors occur when one or more of these intervening processes produce an incorrect output. For example, if a person misperceives a display indicator, the bad information will propagate through the person’s cognitive system and lead to actions that result in an error because decisions are based on this information.
97A third view of human errors, popular within the context of aviation, borrows from an aeromedical perspective (Raymond & Moser, 1995), that is, one involving the medical aspects of physiological and psychological disorders associated with flight. From this view, errors can be attributed to an underlying physiological condition. This approach emphasizes the role of physio- logical status in affecting human performance. This perspective has been responsible for much of the attention devoted to the factors of fatigue and emotional stress, and how they are influenced by work schedules and shift rotations.
98
99 Two final views emphasize group interactions and their effects on human error with an emphasis on psychosocial and organizational perspectives (Dekker, 2005; Perrow, 1999). The psychosocial perspective looks at performance as a function of interactions among many people. This is particularly relevant for commercial aviation, where there are several members of the flight crew, each with different duties, air traffic controllers with whom they must communicate, and flight attendants who interact with both passengers and crew. In addition, ground crews supervise loading and fueling of the aircraft, and maintenance personnel work on maintaining the aircraft in good condition. The psychosocial perspective emphasizes that errors occur when communications among the group members break down.
100The organizational perspective, which emphasizes the roles played by managers, supervisors, and other people in an organizational hierarchy, is important in industrial settings. The risky, and ultimately fatal, decision to launch Challenger on a cold morning, despite concern expressed by engineers that the O-rings would not properly seal the joints at low temperatures, is one of the most well-known incidents in which social and organizational dynamics were significant contributors to a disaster.
101ERROR TAXONOMIES
102It is useful to discuss human error with a taxonomy or scheme for categorizing different kinds of errors. There are many useful error taxonomies (see Stanton, 2006a). Some refer to the type of action taken or not taken, others to particular operational procedures, and still others to the errors’ locations in the human information processing system. We describe the taxonomies of action, failure, processing, and intentional classification, and the circumstances under which each is most appropriate.
103Action classification. Some errors can be traced directly to an operator’s action or inaction (Meister & Rabideau, 1965). An error of omission is made when the operator fails to perform a required action. For example, a worker in a chemical waste disposal plant may omit the step of opening a valve in the response sequence to a specific emergency. This omission might be in relation to a single task (failing to open the valve) within a more complicated procedure, or an entire procedure (failing to respond to an emergency). An error of commission occurs when an action is performed, but it is inappropriate. In this case, the worker may close the valve instead of opening it.
104We can further subdivide commission errors into timing errors, sequence errors, selection errors, and quantitative errors. A timing error occurs when a person performs an action too early or too late (e.g., the worker opened the valve but too late for it to do any good). A sequence error occurs when the worker performs the steps in the wrong order (e.g., the worker opened the valve but before waste had been diverted to that valve). A selection error occurs when the worker manipulates the wrong control (e.g., the worker opened a valve but it was the wrong one). Finally, a quantitative error occurs when the worker makes too little or too much of the appropriate control manipulation (e.g., the worker opened the valve but not wide enough).
105Failure classification. An error may or may not inevitably lead to a system failure. This is the distinction between recoverable and nonrecoverable errors. Recoverable errors are ones that can potentially be corrected and their consequences minimized. In contrast, nonrecoverable errors are those for which system failure is inescapable. Human errors are most serious when they are nonrecov- erable; for recoverable errors, we must design system features that provide feedback to operators to make them aware of their errors and the actions operators should take to recover the system.
106Human-initiated system failures can arise because of operating, design, assembly, or installation= maintenance errors (Meister, 1971). An operating error occurs when a machine is not operated according to the correct procedure. A design error can occur when the system designer creates an error-likely situation by failing to consider human tendencies or limitations. An assembly or manufacturing error arises when a product is misassembled or faulty, and an installation or maintenance error occurs when machines are either installed or maintained improperly.
107
108 In 1989, British Midlands flight 092 from London to Belfast reported vibration and a smell of fire in the cockpit—signs of an engine malfunction. Although the malfunction occurred in the Number 1 engine, the crew throttled back the Number 2 engine and tried to make an emergency landing using only the malfunctioning engine. During the landing approach, the Number 1 engine lost power, resulting in a crash 900 m short of the runway. Forty-seven of the 126 passengers and crew were killed. At first, investigators speculated that there had been a maintenance error: possibly the fire-warning panel of the Boeing aircraft was miswired to indicate that the wrong engine was on fire. The investigation revealed that in fact the panel had not been miswired. However, the tragedy led to inspections of other Boeing aircraft, which revealed 78 instances on 74 aircraft of miswiring in the systems designed to indicate and extinguish fires (Fitzgerald, 1989). To avoid future wiring errors during assembly and maintenance, Boeing redesigned the panel wiring connectors so that each would be a unique size and miswiring would be impossible.
109Processing classification. We can also classify errors according to their locus within the human information processing system (Payne & Altman, 1962). Input errors are those attributable to sensory and perceptual processes. Mediation errors reflect the cognitive processes that translate between perception and action. Finally, output errors are those that are due to the selection and execution of physical responses.
110Table 3.2 shows a more detailed processing classification (Berliner et al., 1964). In addition to distinguishing input (perceptual), mediation, and output (motor) errors, this classification recognizes another type of error, referred to as communication error. These communication errors reflect failures of team members to accurately transmit information to other members of the team. Table 3.2 also lists specific behaviors for which errors of each type can occur. In subsequent chapters, we elaborate the details of the human information processing system and describe in more detail the specific sources of errors within it.
111Rasmussen (1982) developed an information processing failure taxonomy that distinguishes six types of failures: stimulus detection, system diagnosis (a decision about a problem), goal setting, strategy selection, procedure adoption, and action. An analysis of approximately 2000 U.S. Naval aviation accidents using Rasmussen’s and other taxonomies concluded that major accidents had a different cognitive basis than minor ones (Wiegmann & Shappell, 1997). Major accidents were associated with judgment errors like decision making, goal setting, or strategy selection, whereas
112 TABLE 3.2
113Berliner et al.’s Processing Classification of Tasks
114Processes
115Perceptual
116Mediational
117Communication Motor
118Activities
119Searching for and receiving information
120Identifying objects, actions, and events
121Information processing
122Problem solving and decision making
123Simple, discrete tasks
124Example Behaviors
125Detect; inspect; observe; read; receive; scan; survey Discriminate; identify; locate
126Calculate; categorize; compute; encode; interpolate; itemize; tabulate; transfer
127Analyze; choose; compare; estimate; predict; plan
128Advise; answer; communicate; direct; indicate; inform; instruct; request; transmit
129Activate; close; connect; disconnect; hold; join; lower; move; press; raise; set
130Align; regulate; synchronize; track; transport
131Complex, continuous tasks Source: From Berliner, Angell, & Shearer (1964).
132
133 minor accidents were associated more frequently with procedural and response execution errors. This study illustrates how a more detailed analysis may help to resolve issues about the nature of errors, in this case whether the causes of major accidents are fundamentally similar to or different from the causes of minor accidents.
134Intentional classification. We can classify errors as slips or mistakes, according to whether or not someone performed the action that he or she intended. A slip is a failure in execution of action, whereas a mistake arises from errors in planning of action. Reason (1990) related the distinction between slips and mistakes to another taxonomy of behavior modes developed by Rasmussen (1986, 1987). According to this taxonomy, an operator is in a skill-based mode of behavior when performing routine, highly overlearned procedures. When situations arise that are relatively unique, the operator switches to a rule-based mode, where his or her performance is based on recollection of previously learned rules, or a knowledge-based mode, where performance is based on problem solving. Reason attributes slips to the skill-based mode and mistakes to either misapplication of rules or suboptimal problem solving.
135Consider an operator in a nuclear power plant who intended to close pump discharge valves A and E but instead closed valves B and C inadvertently. This is a slip. If the operator used the wrong procedure to depressurize the coolant system, this is a mistake (Reason, 1990). For a slip, the deviation from the intended action often provides the operator with immediate feedback about the error. For example, if you have both mayonnaise and pickle jars on the counter when making a sandwich, and intend to open the mayonnaise jar to spread it on your sandwich, you will notice your error quickly if you slip and open the pickle jar instead. You do not get this kind of feedback when you make a mistake, because your immediate feedback is that the action you performed was executed correctly. It is the intended action that is incorrect, and so the error is more difficult to detect. Consequently, mistakes are more serious than slips. We can also identify a third category of errors, lapses, which involve memory failures such as losing track of your place in an action sequence (Reason, 1990).
136There are three major categories of slips (Norman, 1981): faulty formation of an action plan, faulty activation of an action schema, and faulty triggering of an action schema. An action schema is an organized body of knowledge that can direct the flow of motor activity. We discuss more about action schemas in Chapter 10. For now it is only important to understand that before an action is performed, it must be planned or programmed, and this is what an action schema does. Well- practiced or familiar actions may come from an action schema.
137The faulty formation of an action plan is often caused by ambiguous or misleading situations. Slips resulting from poor action plans can either be mode errors, due to the misidentification of a situation, or description errors for which the action plan is ambiguous or incomplete. Mode errors can occur when instruments, like a digital watch, have several display modes. You can misinterpret a display (e.g., reading the dial as the current time when the watch was displaying stopwatch time), and perform an action (e.g., turning off the oven) that would have been appropriate had the display been in a different mode.
138The second category of slip, faulty activation of action schemas, is responsible for such errors as failing to make an intended stop at the grocery store while driving home from work. The highly overlearned responses that take you home are activated in place of the less common responses that take you to the grocery store. The third kind of slip, faulty triggering of one of several activated schemas, arises when a schema is triggered at the incorrect time or not at all. Common forms of such errors occur in speech. For example, ‘‘spoonerisms’’ are phrases in which words or syllables are interchanged. You might say, ‘‘You have tasted the whole worm’’ instead of ‘‘You have wasted the whole term,’’ for example.
139In contrast to slips, we can attribute mistakes to the basic processes involved in planning (Reason, 1987). First, all the information a person needs to act correctly may not be part of the information he or she uses in the planning processes. The information he or she does use, selected according to a number of factors such as attention or experience, will then include only a small amount of potentially relevant information or none at all. Second, the mental operations he or she
140
141 engages to plan an action are subject to biases, such as paying too much attention to vivid information, a simplified view of how facts are related, and so on. Third, once he or she formulates a plan, or sequence of action schemas, it will be resistant to modification or change; he or she may become overconfident and neglect considering alternative action plans. Various sources of bias can lead to inadequate information on which to base the choice of action, unrealistic goals, inadequate assessment of consequences, and overconfidence in the formulated plan.
142We can find an application of the slips=mistakes taxonomy in a study of human error in nursing care (Narumi et al., 1999). Records of reported accidents and incidents in a cardiac ward from August 1996 to January 1998 showed that 75 errors caused patients discomfort, and these were split about evenly between skill-based slips (36) and rule-based mistakes (35), with the remaining four errors being knowledge-based slips. Of 12 life-threatening errors, 11 were rule-based mistakes. The 12th error was a skill-based slip. There were only four errors involving procedural matters, with three due to skill-based slips and one a knowledge-based error. Note that, as for Wiegmann and Shappell’s (1997) study of human error in aviation, major errors involved decisions (in this case, predominantly rule-based mistakes) and minor errors tended to be of a more procedural nature (action slips).
143We can distinguish errors (slips, mistakes, and lapses) from violations, which involve disregard for the laws and rules that are to be followed (Reason, 1990; Wiegmann et al., 2005). Routine violations are those that occur on a regular basis, such as exceeding the speed limit when driving on the highway. They may be tolerated or encouraged by organizations or individuals in authority, as would be the case if they adopt a policy of not ticketing a driver for speeding unless the vehicle’s speed is more than 10 mph above the speed limit. As this example suggests, routine violations can be managed to some extent by authorities adopting appropriate policies. Exceptional violations are those that do not occur on a regular basis, such as driving recklessly in an attempt to get to the office of an overnight postal service before it closes for the day. Exceptional violations tend to be less predictable and more difficult to handle than routine violations.
144Errors and violations both are unsafe acts performed by operators. Reason (1990) also distin- guished three higher levels of human failure: organizational influences, unsafe supervision, and preconditions for unsafe acts. The Human Factors Analysis and Classification System (HFACS; Wiegmann & Shappell, 2003) provides a comprehensive framework for human error, distinguishing 19 categories of causal factors across the different levels (see Figure 3.2). At the highest level are organizational influences, which include the organizational climate and process, and how resources are managed. These may lead to unsafe supervision, including inadequate supervision and violations on the supervisor’s part, planning inappropriate operations, and failing to correct problems. Unsafe supervision may result in preconditions for unsafe acts, which can be partitioned into factors involving the physical and technical environments, conditions of operators (adverse mental and physiological states, as well as physical and mental limitations), and personnel factors (crew resource management and personnel readiness). The unsafe acts that may then occur are classified in a manner similar to the errors and violations described previously, but with a slightly different distinction made among the error categories.
145The strength of HFACS is that it incorporates organizational, psychosocial, aeromedical, and cognitive approaches to human error within a single framework. HFACS provides a valuable tool for analyzing human error associated with general and commercial aviation (Wiegmann et al., 2005), military aviation (Li & Harris, 2005), remotely piloted aircraft (Tvaryanas et al., 2006), and train accidents (Reinach & Viale, 2006).
146SUMMARY
147The four error taxonomies (action, failure, processing, and intentional classification) capture different aspects of human performance, and each has different uses. The action and failure classifications have been used with success to analyze human reliability in complex systems, but
148
149 Organizational influences
150 Resource management
151Organizational climate
152Organizational process
153 Unsafe supervision
154 Inadequate supervision
155Planned inappropriate operations
156Failure to correct problem
157Supervisory violations
158 Preconditions for unsafe acts
159 Environmental factors
160Personnel factors
161 Physical environment
162Technological environment
163Condition of operators
164Crew resource management
165Personal readiness
166 Adverse mental states
167Adverse physiological states
168Physical/mental limitations
169 Unsafe acts
170 Errors
171Violations
172 Skill-based errors
173Decision errors
174Perceptual errors
175Routine
176Exceptional
177FIGURE 3.2 The human factors analysis and classification system framework.
178they categorize errors only at a superficial level. That is, errors that are considered to be instances of the same action category may have quite different cognitive bases. The processing and intentional classifications are ‘‘deeper’’ in the sense that they identify underlying causal mechanisms within the human operator, but they require us to make more assumptions about how people process
179
180 information than do the action and failure classifications. Because the processing and intentional classifications focus on the root causes of the errors, they have the potential to be of greater ultimate use than the classifications based on surface error properties. HFACS, which incorporates these latter classifications within a context of organizational, psychosocial, and aeromedical factors, provides the a good framework for comprehensively analyzing human error in complex systems.
181RELIABILITY ANALYSIS
182When a system performs reliably, it completes its intended function. The discipline of reliability engineering began to develop in the 1950s (Birolini, 1999). The central tenet of reliability engin- eering is that the total system reliability can be determined from the reliabilities of the individual components and their configuration in the system. Early texts (e.g., Bazovsky, 1961) and compre- hensive works (e.g., Barlow & Proschan, 1965) provided quantitative bases for reliability analysis by combining the mathematical tools of probability analysis with the organizational tools of system analysis.
183The successful application of reliability analyses to hardware systems led human factors special- ists to apply a similar logic to human reliability. In recent years, the discipline of reliability engineering has shown increasing recognition of the importance of including estimates of human performance reliability as part of an overall reliability analysis of a complex system such as a nuclear power plant (Dhillon, 1999; La Sala, 1998). This is because human error is a contributing factor in the majority of serious incidents involving any complex system. In the sections that follow, we describe the basics of reliability analysis in general and then explain human reliability analysis in more detail.
184SYSTEM RELIABILITY
185Although it would be nice if constructed systems functioned well forever, they do not. The term reliability is used to characterize the dependability of performance for a system, subsystem, or component. We define reliability as ‘‘the probability that an item will operate adequately for a specified period of time in its intended application’’ (Park, 1987, p. 149). For any analysis of reliability to be meaningful, we need to know exactly what system performance constitutes ‘‘adequate’’ operation. The decision about what constitutes adequate operation will depend on what the system is supposed to accomplish.
186There are three categories of failure for hardware systems: operating, standby, and on-demand failures (Dougherty & Fragola, 1988). At the time that this chapter was written, one of the authors was experiencing a building-wide air conditioning system failure. This failure was not an operating failure, because the air conditioning never came on. If the air conditioning had started working and then failed, we would have called the situation an operating failure. The people in charge of maintaining the air conditioning system argued that it was an on-demand failure: although the system was adequately maintained during the winter, they claimed it could not be turned on when the weather became unseasonably warm. The building staff, on the other hand, having experienced intermittent operating failures of the same system during the previous warm season, argued that it was a standby failure: poor maintenance of an already unreliable system resulted in a failure over the winter months when the system was not in operation.
187A successful analysis of system reliability requires that we first determine an appropriate taxonomy of component failures. After this determination, we must estimate the reliabilities for each of the system components. Reliability of a component is the probability that it does not fail. Thus, the reliability r is equal to 1 ! p, where p is the probability of component failure. When we know or can estimate the reliabilities of individual components, we can derive the overall system reliability by developing a mathematical model of the system using principles of probability. For these methods, we usually rely on empirical estimates of the probability p, or how frequently a particular system component has been observed to fail in the past.
188
189 A
190 A
191B
192C
193B
194 FIGURE 3.3 Examples of (left) serial and (right) parallel systems.
195When determining system reliability, a distinction between components arranged in series and in parallel becomes important (Dhillon, 1999). In many systems, components are arranged such that they all must operate appropriately if the system is to perform its function. In such systems, the components are in series (see Figure 3.3). When independent components are arranged in series, the system reliability is the product of the individual probabilities. For example, if two components, each with a reliability of 0.9, must both operate for successful system performance, then the reliability R of the system is 0.9 3 0.9 1⁄4 0.81. More generally,
196Yn i1⁄41
197where ri is the reliability of the ith component.
198Remember two things about the reliability of a series of components. First, adding another
199component in series always decreases the system reliability unless the added component’s reliability is 1.0 (see Figure 3.4). Second, a single component with low reliability will lower the system reliability considerably. For example, if three components in series each have a reliability of 0.95,
2001.0 0.8 0.6 0.4 0.2 0.0
201FIGURE 3.4 Reliability of a serial system as a function of number of task components and the reliability of each component.
202R 1⁄4 (r1) " (r2) " # # # " (rn) 1⁄4
203ri,
204C
205 1 Component
2062 Components
2073 Components
2084 Components 5 Components
209 1.0 0.9 0.8
210Component reliability
2110.7 0.6 0.5
212System reliability
213
214 the system reliability is 0.90. However, if we replace one of these components with a component whose reliability is 0.20, the system reliability drops to 0.18. In a serial system, the reliability can only be as great as that of the least reliable component.
215Another way to arrange components is to have two or more perform the same function. Successful performance of the system requires that only one of the components operate appropri- ately. In other words, the additional components provide redundancy to guard against system failure. When components are arranged in this manner, they are parallel (see Figure 3.2). For a simple parallel system in which all components are equally reliable,
216R 1⁄4 [1 " (1 " r)n], r is the reliability of each individual component, and
217where
218n is the number of components arranged in parallel.
219In this case, we compute overall system reliability by calculating the probability that at least one component remains functional.
220The formula for the reliability of a parallel system can be generalized to situations in which the components do not have equal reliabilities. In this case,
221Yn
222R 1⁄4 1 " [(1 " r1)(1 " r2) # # # (1 " rn)] 1⁄4 1 " (1 " ri),
223i1⁄41
224where ri is the reliability of the ith component. When i groups of n parallel components with equal
225reliabilities are arranged in series,
226ni R1⁄4[1"(1"r) ]:
227More generally, the number of components within each group need not be the same, and the reliabilities for each component within a group need not be equal. We find the total system reliability by considering each of n subsystems of parallel components in turn. Let ci be the number of components operating in parallel in the ith group, and let rji be the reliability of the jth component in the ith group (see Figure 3.5). The reliability for the ith subsystem is
228Yn R1⁄4 Rk:
229k1⁄41
230Whereas in serial systems the addition of another component dramatically decreases system reliability, in parallel systems it increases system reliability. It is clear from the expression for Ri that, as the number of parallel components increases, the reliability tends to 1.0. As an illustration, the system reliability for five parallel components each with a reliability of 0.20 is 1.0"(1.0"0.20)51⁄40.67. When 10 components each with a reliability of 0.20 are arranged in parallel, the system reliability is 1.0 " (1.0 " 0.20)10 1⁄4 0.89. This makes sense if you think of all the components in a parallel system as ‘‘backup units.’’ The more backup units you have, the greater the probability that the system will continue to function even if a unit goes bad.
231Ri 1⁄41"
232Total system reliability, then, is the reliability of the series of parallel subsystems:
233Yci j1⁄41
234(1"rji):
235
236 r11=.95
237 r21=.88
238r13 =.80
239 r32 =.78
240 r31=.64
241r42 =.70
242r23 =.75
243 342
244R1=1.00 − Π(1.00 − ri1) R2=1.00 − Π(1.00−ri2) R3=1.00 − Π(1.00 − ri3)
245i=1 i=1 i=1
246= 1.00 − (.05)(.12)(.36) =1.00 − .002
247= .998
248= 1.00 − (.08)(.10)(.22)(.30)
249=1.00 −.001
250= 1.00 − (.20)(.25) =1.00 − .050
251= .950
252FIGURE 3.5 Computing the reliability of a series of parallel subsystems.
253R
254= Π Ri i=1
255= .999 3
256= .947
257r12 =.92
258 r22 =.90
259Some effects on a system, such as the heat caused by a fire, are sudden. Other environmental processes, such as the effect of water on underwater equipment, affect the reliability of the system continuously over time. Consequently, we use two types of reliability measures. For demand- or shock-dependent failures, r 1⁄4 P(S < capacity of the object). That is, reliability is defined as the probability that the level of shock S does not exceed the capacity of the equipment to withstand the shock during the equipment’s operation. For time-dependent failures, r(t) 1⁄4 P(T > t), where T is the time of the first failure. In other words, reliability for time-dependent processes is defined as the probability that the first failure occurs after time t. When we have to consider many components simultaneously, as within the context of a large system, time-dependent reliability analysis can be extremely difficult.
260We will talk a lot about models in this text. A model is an abstract, simplified, usually mathemat- ical representation of a system. The model has parameters that represent physical (measurable) features of the system, such as operating time or failure probabilities, and the structure of the model determines how predictions about system performance are computed. Later in this chapter, and later in this book, we will talk about models that represent the human information processing system. Such models do not always represent how information is processed very accurately, and sometimes it is very difficult to interpret their parameters. However, the power of these models is in the way they simplify very complex systems and allow us to make predictions about what the system is going to do.
261There is considerable debate about whether the focus of reliability analysis should be on empirically based quantitative models of system architecture, like the serial and parallel models we have described in this section, or on ‘‘physics-of-failure’’ models (Denson, 1998). Physics-of-failure models are concerned with identifying and modeling the physical causes of failure, and advocates of this approach have argued that reliability predictions with it can be more accurate than those derived from empirical estimates of failure probabilities. As we shall see in the next section, the human reliability literature has been marked by a similar debate between models that focus on the reliabilities of observable actions and models that focus on the cognitive processes that underlie these actions.
262HUMAN RELIABILITY
263We can apply procedures similar to those used to determine the reliability of inanimate systems to the evaluation of human reliability in human–machine systems (Kirwan, 2005). In fact, to perform a
264
265 probabilistic safety analysis of complex systems such as nuclear power plants, we must provide estimates of human error probabilities as well as machine reliabilities, since the system reliability is to a considerable extent dependent on the operators’ performance. Human reliability analysis thus involves quantitative predictions of operator error probability and of successful system performance, although there has been increasing interest in understanding the causes of possible errors as well (e.g., Hollnagel, 1998; Kim, 2001).
266Operator error probability is defined as the number of errors made (e) divided by the number of opportunities for such errors (O, e.g., Bubb, 2005):
267e P(operator error) 1⁄4 O :
268Human reliability thus is 1 " P(operator error). Just as we can classify hardware failures as time- dependent and time-independent, we can also classify operator errors.
269We can carry out a human reliability analysis for both normal and abnormal operating conditions. Any such analysis begins with a task analysis that identifies the tasks performed by humans and their relation to the overall system goals (see Box 3.1). During normal operation, a person might perform the following important activities (Whittingham, 1988): routine control (maintaining a system variable, such as temperature, within an acceptable range of values); preventive and corrective maintenance; calibration and testing of equipment; restoration of service after maintenance; and inspection. In such situations, errors of omission and commission occur as discrete events within the sequence of a person’s activity. These errors may not be noticed or have any consequence until abnormal operating conditions arise. Under abnormal operating conditions, the person recognizes and detects fault conditions, diagnoses problems and makes decisions, and takes actions to recover the system. Although action-oriented errors of omission and commission can still occur during recovery, perceptual and cognitive errors become more likely.
270 BOX 3.1 Task Analysis
271A first step in human reliability analysis is to perform a task analysis. Such an analysis examines in detail the nature of each component task, physical or cognitive, that a person must perform to attain a system goal, and the interrelations among these component tasks. Task analysis is also a starting point in general for many other human factors concerns, including the design of interfaces and development of training routines. A fundamental idea behind task analysis of any type is that tasks are performed to achieve specific goals. This emphasis on task and system goals is consistent with the importance placed on system goals in systems engineering, and it allows the task analysis to focus on ways to structure the task to achieve those goals.
272As we discussed in Chapter 1, Taylor (1911) and Gilbreth (1909) developed the first task analysis methods. They analyzed physical tasks in terms of motion elements and estimated the time to perform the whole task by adding together the time for each individual motion element. In so doing, Taylor and Gilbreth could redesign tasks to maximize the speed and efficiency with which they could be performed. Taylor and Gilbreth’s approaches focused primarily on physical work and, consequently, were applicable primarily to repetitive physical tasks of the type performed on an assembly line. During the century that has passed since their pioneering efforts, the nature of work has changed and, consequently, many different task analysis methods have been developed to reflect these changes (Diaper & Stanton, 2004; Strybel, 2005).
273 (continued )
274
275 BOX 3.1 (continued ) Task Analysis
276One of the most widely used task analysis methods is hierarchical task analysis (Annett, 2004; Stanton, 2006b). In hierarchical task analysis, the analyst uses observations and interviews to infer the goals and subgoals for a task, the operations, or actions that a person must perform in order to achieve those goals, and the plans that specify the relations among the component operations. The end result is a diagram specifying the structure of the task. An example of a hierarchical task analysis for a simple task, selecting an item from a pop-up menu in a computer application, is shown in Figure B3.1 (Schweickert et al., 2003): This diagram shows the goal (selecting an item), three elementary operations (search the menu, move the cursor, and double click), and the plan specifying the order of these operations. Of course, the diagrams for most tasks will be considerably more complex than this.
277One of the major changes in jobs and tasks with increasingly sophisticated technology is an increase in cognitive demands and a decrease in physical demands in many work environments. Consideration of cognitive demands is the primary concern for computer interface design, which is the target of much current work on task analysis (Diaper & Stanton, 2004). Consider a Web site, for example. The information that needs to be available at the site may be quite complex and varied in nature, and different visitors to the site may have different goals. Task analyses must evaluate the goals that users have in accessing this information, the strategies they employ in searching for the information, how to structure the information to allow users to be able to achieve their goals, and the best ways to display this information to maximize the efficiency of the search process (Proctor et al., 2003; Strybel, 2005).
278The term cognitive task analysis refers to techniques that analyze the cognitive activity of the user or operator, rather than the user’s observable physical actions (May & Barnard, 2004; Schraagan et al., 2000). The most widely used analysis method of this type, which was developed explicitly for human–computer interaction (HCI), is the GOMS model and its variants (John, 2003), described in more detail in Chapter 19. GOMS stands for goals, operators, methods, and selection rules. With a GOMS analysis, a task is described in terms of goals and subgoals, and methods are the ways in which the task can be carried out. A method specifies a sequence of mental and physical operators; when more than one method exists for achieving the task goal, a selection rule is used to choose which is used. A GOMS model can predict the time to perform a task by estimating the time for each of the individual operations that must be performed in order to accomplish the task goal.
279Plan 0: Together do 1 and 2–3
280FIGURE B3.1 Example of a hierarchical task analysis for selecting an item from a pop-up menu.
281 0. Select an option from a menu
282 1. Search menu
2832. Move cursor toward menu
2843. Double click on chosen option
285
286 Computational method
287 Describe system
288Identify potential errors
289Estimate error likelihood
290Estimate error consequences
291Combine error probabilities
292Predict task/system success probability
293 Determine relevant parameters and moderating factors
294Monte Carlo method
295 Describe system
296Compile/ enter input data
297Simulate system/ personnel operations
298Output run data
299Repeat model runs
300Predict task/system success probability
301 Determine relevant parameters and moderating factors
302FIGURE 3.6 Computational and Monte Carlo methods of conducting human reliability analysis.
303Human reliability analyses are based on either Monte Carlo methods that simulate performance on the basis of a system model or computational methods that analyze errors and their probabilities (Boff & Lincoln, 1988). The steps for performing such analyses are shown in Figure 3.6. As in any system=task analysis, the first step for both methods involves a description of the system, that is, its components and their functions. For the Monte Carlo method, the next step is to model the system in terms of task interrelations. At this stage we must make decisions about the random behavior of task times (e.g., are they normally distributed?) and select success probabilities to simulate the operations of the human and the system. We repeat the simulation many times; each time, it either succeeds or fails in accomplishing its task. The reliability of the human or system is the proportion of times that the task is completed in these simulations.
304For the computational method, after we describe the system, we identify potential errors for each task that must be performed and estimate the likelihood and consequences of each error. We then use these error probabilities to compute the likelihood that the operator accomplishes his or her tasks appropriately and the probability of success for the entire system. Error probabilities can come from many sources, described later; they must be accurate if the computed probabilities for successful performance of the operator and the system are to be meaningful.
305The Monte Carlo and computational methods are similar in many respects, but each has its own strengths and weaknesses. For example, if the computational method is to be accurate, we must perform detailed analyses of the types of errors that can occur, as well as their probabilities and consequences. The Monte Carlo method, in turn, requires us to develop accurate models of the system.
306There are many ways to perform a human reliability analysis. One review summarized 35 techniques that had either direct or potential application to the field of healthcare (Lyons et al., 2004). Kirwan (1994) provides a more detailed review of several major techniques available for the quantification of human error probabilities and discusses guidelines for the selection and use of
307
308 techniques. There is a difference between first- and second-generation techniques (Hollnagel, 1998; Kim, 2001), although they overlap somewhat. First-generation techniques closely follow those of a traditional reliability analysis but analyzing human task activities instead of machine operations. They typically emphasize observable actions, such as errors of commission and omission, and place little emphasis on the cognitive processing underlying the errors. The second-generation techniques are much more cognitive in nature. We provide detailed examples of two first- generation techniques, one that uses the Monte Carlo method (the stochastic modeling technique) and another that uses the computational method (technique for human error rate prediction— THERP), and two associated more recent relatives of them (systematic human error reduction and prediction approach—SHERPA and task analysis for error identification—TAFEI). We then describe three representative second-generation techniques (human cognitive reliability—HCR, a technique for human error analysis—ATHEANA, and cognitive reliability and error analysis method—CREAM).
309Stochastic modeling technique. An example of the Monte Carlo method of human reliability analysis is the stochastic modeling technique developed by Siegel and Wolf (1969). The technique is intended to determine if an average person can complete all tasks in some allotted time, and to identify the points in the processing sequence at which the system may overload its operators (Park, 1987). It has been applied in complex situations, such as landing an aircraft on a carrier, in which there are many subtasks that a pilot must execute properly. The model uses estimates of the following information:
3101. Mean time to perform a particular subtask; the average variability (standard deviation) in performance time for a representative operator
3112. Probability that the subtask will be performed successfully
3123. Indication of how essential successful performance of the subtask is to completion of
313the task
3144. Subtask that is to be performed next, which may differ as a function of whether or not the
315initial subtask is performed successfully
316We make three calculations based on these data for each subtask (Park, 1987). First, urgency and stress conditions are calculated according to the subtasks to be performed by the operator in the remaining time. Second, a specific execution time for the subtask is selected by randomly sampling from an appropriate distribution of response times. Finally, whether the subtask was performed correctly is determined by random sampling using the probabilities for successful and unsuccessful performance.
317The stochastic modeling technique is used to predict the efficiency of the operator within the entire system based on the simulated performance of each subtask. This technique has been applied with reasonable success to a variety of systems. Moreover, it has been incorporated into measures of total system performance.
318Technique for human error rate prediction. THERP, developed in the early 1960s, is one of the oldest and most widely used computational methods for human reliability analysis (Swain & Guttman, 1983). It was designed initially to determine human reliability in the assembly of bombs at a military facility, and it subsequently has been the basis of reliability analyses for industry and nuclear facilities (Bubb, 2005).
319The reliability analyst using THERP proceeds through a series of steps (Miller & Swain, 1987):
3201. Determine the system failures that could arise from human errors.
3212. Identify and analyze the tasks performed by the personnel in relation to the system
322functions of interest.
3233. Estimate the relevant human error probabilities.
324
325 4. Integrate the human reliability analysis with a system reliability analysis to determine the effects of human errors on the system performance.
3265. Recommend changes to the system to increase the reliability, and then evaluate these changes.
327The most important steps in THERP are the third and fourth steps. These involve determining the probability that an operation will result in an error and the probability that a human error will lead to system failure. Such probabilities can be estimated from a THERP database (Swain & Guttmann, 1983) or from any other data, such as simulator data, that may be relevant.
328Figure 3.7 depicts these probabilities in an event-tree diagram. In this figure, a is the probability of successful performance of task 1, and A is the probability of unsuccessful performance. Similarly, b and B are the probabilities for successful and unsuccessful performance of task 2. The first branch of the tree thus distinguishes the probability of performing or not performing task 1. The second level of branches involves the probabilities of performing or not performing task 2 successfully, depending on the performance of task 1. If the two tasks are independent (see Chapter 2), then the probability of completing task 2 is b and of not completing it is B. If we know the probability values for the individual component tasks, we can compute the probability of any particular combination of performance or nonperformance of the tasks, as well as the overall likelihood for total system failure resulting from human error.
329As an example, suppose that we need to perform a THERP analysis for a worker’s tasks at one station on an assembly line for portable radios. The final assembly of the radio requires that the electronic components be placed in a plastic case. To do this successfully, the worker must bend a wire for the volume control to the underside of the circuit board and snap the two halves of the case together. If the worker fails to wrap the wire around the board, the wire may be damaged when he or she closes the case. The worker might also crack the case during assembly. The probability that the worker positions the wire correctly is .85, and the probability that the worker does not crack the case is .90. Figure 3.8 illustrates the event tree for these tasks. The probability that the radio is assembled correctly is .765. The benefit of the THERP analysis in this example is that weaknesses in the procedure, such as the relatively high probability of poorly placing the wire, can be identified and eliminated to increase the final probability of correct assembly.
330Though THERP compares favorably to other human reliability assessment techniques for quantifying errors (Kirwan, 1988), the THERP error categorization procedure relies on the action
331 Success
332Failure
333a
334A
335 b|a B|a b|A B|A
336Success Failure Success Failure
337FIGURE 3.7 Task-=event-tree diagram.
338
339 Task 1: Proper positioning of the wire
340Task 2: Proper assembly of the plastic case
341.85
342.10
343Failure (.85)(.10) = .085
344.15
345Success .90
346Success (.85)(.90) = .765
347.90
348Failure .10
349Failure (.15)(.10) = .015
350 FIGURE 3.8 Event-tree diagram for the assembly of portable radios.
351classification described earlier, that is, on errors of omission and commission. This focus is problematic (Hollnagel, 2000). Because THERP relies on an event tree (Figure 3.8), we see each step in a sequence of actions as either a success or a failure. Categorizing errors in this way is independent of the human information processes that produce the specific errors. More recent techniques, such as the HCR model discussed later, place more emphasis on the processing basis of errors.
352Systematic human error reduction and prediction approach and task analysis for error iden- tification. SHERPA (Embrey, 1986; Stanton, 2005) and TAFEI (Stanton & Baber, 2005) are related methods that can be used easily to predict human errors when a person is interacting with a device. The first step for both is a hierarchical task analysis (see Box 3.1) that decomposes work activities into a hierarchy of goals, operations to achieve the goals, and plans for executing these operations in an appropriate sequence. The resulting task hierarchy provides the basis for determining possible errors and their relative likelihood.
353To use SHERPA, the reliability analyst takes each operation at the lowest level of the task hierarchy and classifies it as one of five types: action, retrieval, checking, selection, and information communication. For each operation, the analyst must identify several possible error modes. For example, an action error may be one of mistiming the action, or a checking error may be one of omitting the check operation. The analyst then considers the consequences of each error, and for each, whether the operator could take any recovery action. The analyst will assign a low ‘‘probability’’ if the error is unlikely to ever occur, medium if it occurs on occasion, and high if it occurs frequently. The analyst also designates each error as critical (if it would lead to physical damage or personal injury) or not critical. In the last step, the analyst provides strategies for error reduction. The structured procedure and error taxonomy makes SHERPA relatively easy to perform, but the analysis does not consider cognitive bases of errors.
354To use TAFEI, after first performing the hierarchical task analysis, the analyst constructs state space diagrams that represent a sequence of states through which the device can pass until it reaches its goal. For each state in the sequence, he will indicate links to other system states to represent the possible actions that can be taken to move the system from the present state to another state. He then enters this information into a transition matrix that shows the possible transitions from different current states to other states. The matrix records legal transitions as well as illegal, error transitions. This procedure results in design solutions that make it impossible for a user to make illegal
355Success (.15)(.90) = .135
356
357 transitions. TAFEI and SHERPA, when used in combination, will allow the analyst to make very accurate reliability predictions.
358Human cognitive reliability model. First-generation models such as the stochastic modeling technique and THERP are primarily concerned with predicting whether humans will succeed or fail at performing various tasks and subtasks. Second-generation models are more concerned with what the operator will do. The HCR model, developed by Hannaman, Spurgin, and Lukic (1985), is one of the earliest second-generation models because of its emphasis on human cognitive processes. The approach was developed to model the performance of an industrial plant crew during an accident sequence. Because the time to respond with appropriate control actions is limited in such situations, the model provides a way to estimate the probability of time-dependent operator failures (nonresponses). The input parameters to the model are of three types: category of cognitive behavior, median response time, and environmental factors that shape performance.
359As with all the other techniques, the human reliability analyst first identifies the tasks the crew must perform. Then, he or she must determine the category of cognitive process required for each task. HCR uses the categories from Rasmussen’s (1986, 1987) taxonomy described earlier: skill- based, rule-based, and knowledge-based behaviors. Recall that skill-based behavior represents the performance of routine, overlearned activities, whereas rule-based and knowledge-based behaviors are not so automatic. Rule-based behavior is guided by a rule or procedure that has been learned in training and knowledge-based behavior occurs when the situation is unfamiliar (see earlier discussion in this chapter). HCR is based on the idea that the median time to perform a task will increase as the cognitive process changes from skill-based to rule-based to knowledge-based behavior.
360The analyst estimates the median response times for a crew to perform its required tasks from a human-performance data source, some of which are described in the next section. He or she then modifies these times by incorporating performance-shaping environmental factors such as level of stress, arrangement of equipment, and so on. The analyst must also evaluate response times according to the time available to perform the task, and so providing a basis for deciding whether the crew will complete the required tasks in the available time.
361The most important part of the HCR model is a set of normalized time–reliability curves, one for each mode of cognitive processing (see Figure 3.9). These curves estimate the probability of a nonresponse at any point in time. The normalized time TN is
362TN 1⁄4 TA ; TM
363where
364TA is the actual time to perform the task, TM is the median time to perform the task.
365The analyst uses these normalized curves to generate nonresponse probabilities at various times after an emergency in the system develops.
366The HCR model was developed and evaluated within the context of operation of nuclear power plants and focuses mainly on the temporal aspects of crew performance. Many of its fundamental hypotheses have been at least partially verified (Worledge et al., 1988), leading Whittingham (1988) to propose that a combination of the HCR and THERP models should provide a good predictor of human reliability. An application of these two models can be found in a report that quantified improvements in human reliability for a nuclear power plant (Ko et al., 2006). The plant had implemented a severe accident management guidance program, which provided operators with structured guidance for responding to an emergency condition. The analysis was conducted to show that the implementation of the structured guidance program changed the operators’ behavior mode
367
368 Knowledge
369Rule Skill
370 0.1 0.01 0.001
371 0.1
372Normalized time
373FIGURE 3.9 HCR model crew nonresponse curves for skill-, rule-, and knowledge-based processing.
374from knowledge based (i.e., problem solving) to rule based (following the rules of the program). This made it more likely that operators will complete necessary tasks within the time limits.
375Technique for human error analysis. Another model representative of a second-generation technique is ATHEANA (USNRC, 2000). As for a typical probabilistic reliability analysis, ATHEANA begins by identifying possible human failure events from accident scenarios. The analyst describes these events by enumerating the unsafe actions (errors of omission or commis- sion) of the operators, and then characterizing them further using Reason’s (1990) distinctions between slips, lapses, mistakes, and violations of regulations. The model combines environmental factors and plant conditions affecting the likelihood of human errors in error-forcing contexts, that is, situations in which an error is likely. The descriptions of these error-forcing contexts may lead to better identification of possible human errors and where they are most likely to occur in a task sequence. The final result of an ATHEANA analysis is a quantitative estimate of the conditional probability of an unsafe action as a function of the error-forcing context in the situation under study.
376ATHEANA is very detailed and explicit. Most importantly, after an accident, the reliability expert can identify particular errors of commission resulting from an error-forcing context. How- ever, it has several limitations (Dougherty, 1997; Kim, 2001). Because it is a variant of probabilistic reliability analysis, it suffers from the many shortcomings associated with it. As one example, ATHEANA continues to make a distinction between errors of commission and omission, which, as we noted earlier, is linked to probabilistic reliability analysis and is independent of the cognitive basis for the errors. Another shortcoming concerns the model’s emphasis on an error-forcing context, which might imply that a particular situation may allow no chance for success. Because this context is used as a substitute for the many factors that influence human cognition and performance in a task, it may be more profitable to develop more detailed models of cognitive reliability, as in the next method we consider.
377Cognitive reliability and error analysis method. CREAM (Hollnagel, 1998) takes a cognitive engineering perspective, according to which the human–machine system is conceptualized as a joint cognitive system, and human behavior is shaped by the context of the organization and techno- logical environment in which it resides. After a task analysis, CREAM requires an assessment of the conditions under which the task is commonly performed. Some of these conditions might include the availability of procedures and plans to the operator, the available time for the task, when the task is performed, and the quality of collaboration among members of the crew. Given the context in which a task is performed, the reliability analyst then develops a profile to identify the cognitive demands of the task. The analyst describes these demands using the cognitive functions of observation, interpretation, planning, and execution. Then, for each task component, the analyst assesses what kinds of strategies or control modes are used by the operators to complete the task.
378Nonresponse probability
379
380 CREAM considers four possible control modes: strategic, tactical, opportunistic, or scrambled. For the strategic mode, a person’s action choices are guided by strategies derived from the global context; for the tactical mode, his or her performance is based on a procedure or rule; for the opportunistic mode, salient features of the context determine the next action; for the scrambled mode, the choice of the next action is unpredictable. The reliability analysis is completed when the reliability expert identifies what cognitive function failures are most likely to occur and computes the cognitive failure probabilities for the task elements and for the task as a whole.
381CREAM is a detailed method for quantifying human error in terms of the operator’s cognitive processes. CREAM’s method is more systematic and clear than that of ATHEANA, and it allows the analyst to perform both predictive and retrospective analyses using the same principles (Kim, 2001). One of its limitations is that it does not explicitly take into consideration how people might recover from erroneous actions: All errors are assumed to be nonrecoverable. This means that CREAM will tend to underestimate human reliability in many situations.
382Human performance data sources. Human reliability analysis requires that we explicitly specify estimates of human performance for various tasks and subtasks. Such estimates include the probability of correct performance, reaction time, and so on. Figure 3.10 shows several possible sources for useful performance estimates. The best estimates come from empirical data directly relevant to the task to be analyzed. Such data may come from laboratory studies, research conducted on trainers and simulators, or on data from actual system operation. Data like these are summarized in data banks (such as Human Reliability Data Bank for Nuclear Power Plant Operators, Topmiller et al., 1982, and the Engineering Data Compendium: Human Perception and Performance, Boff & Lincoln, 1988) and handbooks (such as the Handbook of Human Factors and Ergonomics, Salvendy, 2006), with more detailed descriptions presented in the original research reports. The primary limitation of these data sources is that the most commonly used data come from laboratory studies typically conducted under restricted, artificial conditions; generalization to more complex systems thus should be made with caution. Moreover, the amount of data available in any data bank is limited.
383 Laboratory studies
384Trainers/ simulators
385Operational data
386 FIGURE 3.10 Human performance data sources and outputs.
387Exercises/ war games
388Data sources
389 Subjective judgment
390 Routine operations
391Psychometrically derived
392Informal
393 Empirical data
394 Raw data
395Formatted data (data bank)
396
397 Simulators provide another source of data for complex systems, such as chemical waste disposal plants, for which a failure can be hazardous (Collier et al., 2004). The simulator can create specific accident sequences to analyze the performance of the personnel in such circumstances without endangering the system or its operators. It permits the analyst to measure the response accuracy and latency to critical events, as well as the possibility for using interviews to obtain information from the operators about the displays and indicators to which they attended and how they made decisions (Dougherty & Fragola, 1988, p. 50).
398Another way to estimate human error probability parameters is from computer simulations or mathematical models of human performance (Yoshikawa & Wu, 1999). An accurate model can provide objective probability estimates for situations for which direct empirical data are not available. A final option is to ask experts and obtain their opinions about the probabilities of specific errors. However, information obtained in this way is highly subjective, and so you should interpret it cautiously.
399PROBABILISTIC RISK ANALYSIS
400In complex systems, the risks associated with various system failures are assessed as part of a reliability analysis. Risk refers to events that might cause harm, such as a nuclear power plant releasing radioactive steam into the atmosphere. A risk analysis, therefore, considers not only the reliability of the system, but also the risks that accompany specific failures, such as monetary loss and loss of life. Probabilistic risk analysis, the methods of which were developed and applied primarily within the nuclear power industry, involves decomposing the risk of concern into smaller elements for which the probabilities of failure can be quantified (Bedford & Cooke, 2001). These probabilities then are used to estimate the overall risk, with the goal of establishing that the system is safe and to identify the weakest links (Paté-Cornell, 2002).
401The human risk analysis of a complex system like a nuclear plant includes the following goals:
4021. Represent the plant’s risk contribution from its people and their supporting materials, such as procedures
4032. Provide a basis upon which plant managers may make modifications to the plant while optimizing risk reduction and enhancing human factors
4043. Assist the training of plant operators and maintenance personnel, particularly in contin- gencies, emergency response, and risk prevention (Dougherty & Fragola, 1988, p. 74)
405The nuclear power industry uses probabilistic risk analysis methods to identify plant vulner- abilities, justify additional safety requirements, assist in designing maintenance routines, and support the decision-making process during routine and emergency procedures (Zamanali, 1998).
406The focus of a reliability analysis is on successful operation of the system, and so we look at the system environment in terms of its influence on system performance. In contrast, the focus of a risk analysis is to evaluate the influence of system failures on the environment. Maximization of system reliability and minimization of system risk require that we conduct risk and reliability analyses and address design concerns at all phases of system development and implementation.
407SUMMARY
408The operator is part of a human–machine system. Consequently, the system concept plays a central role in human factors. We must examine the contribution of the operator from within the context of the system. The performance of a system depends on many variables, some unique to the mechan- ical aspects of the system and some unique to the human aspects of the system. We can find still more variables in the system environment.
409
410 Errors by a system’s operator can result in system failure. A fundamental goal of human factors is to minimize risk while maximizing system reliability. This requires that the human factors expert performs an analysis of the sources of potential human errors and an evaluation of their conse- quences for overall system performance. For this purpose the expert can use several alternative classifications for types of errors.
411We estimate system reliability from the reliabilities of the system’s components and the structure of the system. Reliability analysis can successfully predict the reliability of machines. Human reliability analysis is based on the assumption that the performance of the operator can be analyzed using similar methods. Human and machine reliability analyses can be combined to predict the overall performance of the human–machine system and the overall risk associated with its operation.
412A theme we repeat frequently in this book is that optimal system design requires us to consider human factors at every stage of the system development or design process. This means we must consider the potential for different types of human errors at every stage of the system development process. By incorporating known behavioral principles into system design and evaluating design alternatives, the human factors specialist insures that the system can be operated safely and efficiently.
413RECOMMENDED READINGS
414Birolini, A. 1999. Reliability Engineering: Theory and Practice. New York: Springer.
415Gertman, D.I. & Blackman, H.S. 1994. Human Reliability & Safety Analysis Handbook. New York: Wiley. Hollnagel, E. 1998. Cognitive Reliability and Error Analysis Method. London: Elsevier.
416Kirwan, B. 1994. A Guide to Practical Human Reliability Assessment. London: Taylor & Francis.
417Reason, J. 1990. Human Error. Cambridge: Cambridge University Press.
418Senders, J.W. & Moray, N.P. 1991. Human Error: Cause, Prediction, and Reduction. Hillsdale, NJ: Lawrence
419Erlbaum.
420Westerman, H.R. 2001. System Engineering Principles and Practice. Boston, MA: Artech House.