· 5 years ago · May 26, 2020, 03:18 PM
1Definition of a real-time system
2The Oxford Dictionary of Computing's definition of a real-time system: Any system in which the time at which output is produced is significant. This is usually because the input corresponds to some movement in the physical world, and the output has to relate to that same movement. The lag from input time to output time must be sufficiently small for acceptable timeliness 'Timeliness' is system-dependent. For instance, a missile guidance system requires output within milliseconds, whereas a computer-controlled water dam may only need output within a second.
3The correctness of a real-time system depends not only on the logical result of the computation, but also on the time at which the result are produced. A hard real-time system are those where it is absolutely imperative that responses occur within the specified deadline. For soft systems, the system will still function correctly if deadlines are occasionally missed. A system may have both soft and hard real-time-sub-systems and soft and hard deadlines. A service may even have both a soft and a hard deadline with different reactions to the two deadlines.
4A deadline that can be missed occasionally, but in which there is no benefit from late delivery, is called firm. Another means of classifying the role that time has in a real-time system, is to distinguish between reactive systems and time-aware systems. Time-aware systems make explicit references to the time frame of the enclosing environment. A reactive system is more concerned with relative times, however, since they are often also control systems, the need to be synchronized with their environment. In order for a reactive system to 'keep up with its environment' they are often structured to be time-triggered. All computations are periodic in that they have a defined cycle time, and are released for execution by an internal clock. The alternative is event- triggered where the environment explicitly controls the release for execution of some activity. These activities are termed aperiodic or sporadic if the number of releasing events within a time interval is bound.
5Edit
6Examples of real-time systems
7A system where a computer interacts with the environment with sensors (for instance temperature or pressure transducer) and actuators (valves, motors in general) are examples of real-time systems. E.g plants, elevators, washing machines, etc. Real-time systems are used in process control, manufacturing, communication, multimedia, command and control, and pretty much everywhere else. These examples are thoroughly explained in the book for those interested.
8Edit
9Characteristics of real-time systems
10Following are characteristics of real-time systems, which will be thoroughly examined in the later chapters of the book. A real-time system must, of course not exhibit all these characteristics, but should have facilities which support these characteristics.
11Edit
12Real-Time facilities
13As mentioned, response time is crucial in any embedded (/real-time) system. As this is often near impossible to guarantee, due to computing resources under all possible conditions, real-time systems are often constructed using processors with considerable spare capacity. This is to ensure that a worst-case scenario does not produce unwelcome delays. Given adequate processing power, language and run-time support are required to enable the programmer to:
14 • specify times at which actions are to be performed;
15 • specify times at which actions are to be completed;
16 • support repeating (periodic or aperiodic) work;
17 • control (i.e. bound) the jitter on input and output operations;
18 • respond to situations where not all of the timing requirements can be met;
19 • respond to situations where the timing requirements are changed dynamically.
20These are called real-time control facilities, and enable the program to synchronize with time itself.
21Some systems are more demanding at certain periods, and therefore needs to change the timing requirements dynamically. As an example, an aircraft flight control system needs to focus at the task at hand, and also be ready to give all computing resources to handle an emergency. Such changes are generally known as mood changes, and the also have consequences for the temporal characteristics of the executing software.
22Edit
23Concurrent control of separate system components
24Often, an embedded system must handle several coexisting external elements with which the computer must interact simultaneously. Then it is necessary to consider distributed and multiprocessor embedded systems. The problem then becomes how to express the concurrency in the structure of programs which exhibit concurrency. Chapter 4, 5 and 6 will consider various models of concurrent programming, and how to achieve reliable communication and synchronization between concurrent processes in the presence of design errors.
25Edit
26Low-level programming
27An embedded system typically requires the computer components to interact with the external world, by monitoring sensors and controlling actuators for a wide variety of devices. These devices interface the computer via input and output registers, and their operational requirements are device- and computer-dependent, and may also generate interrupts. A real-time system is also often run by battery in a mobile device. We have also mentioned that real-time systems are very dependent on response time. Low-level programming eliminates both delay and extra power consumption by using built in language features. High-level programming enables the programmer to abstract away from implementation details, and to concentrate on solving the problem at hand. For a real-time programmer, this luxury is unacceptable, since delay and power consumption are two central problems when programming a real-time system.
28Edit
29Support for numerical computation
30When a real-time system is to control some engineering activity, it often involves continuous signals. For very simple systems, one might have an analog controller working on the continuous signals. One wish to calculate an input, based on comparing the output with a reference signal. This will often require heavy calculations, depending on the complexity of the model and the number of inputs and outputs. Because of these difficulties, most controllers are implemented as digital computers. Since nature is analog, and computers are digital, converters are needed in order to convert analog signals to digital, and vice versa. This must again be taken into account when designing of the control algorithms. A fundamental requirement of a real-time programming language, therefore, is the ability to manipulate real or floating-point numbers. Fortunately most engineering languages do provide the necessary abstractions in this area.
31Edit
32Large and complex
33Largeness is, by Lehman and Belady (1985) related to variety. Since most real-time systems respond to real-world events, the amount of variety is great. Hence, most real-time systems will exhibit the rather undesirable property of largeness. Due to this large- ness, real-time systems needs to be broken down into smaller components which can be managed effectively. Real-time languages provides features, such as abstract data types, classes, objects, generic components, separate compilation, etc to manage this software complexity.
34Edit
35Extremely reliable and safe
36For some systems, failure is simply not an option. Systems providing life support, controlling nuclear plants or transferring money, can not fail, and if they do, however, fail, they must fail in a controlled way. When human interaction is required, care must be taken in the design of the interface in order to minimize the possibility of human error. Chapter 2 and 3 will consider the problems of producing reliable and safe software. Copying with both expected and unexpected error conditions will be examined in chapter 7 and 13(?).
37Edit
38Structure of the book
39If you are interested in the structure of the book, read the book, not the compendium.
40Edit
41Development cycle for real-time systems
42The most important stage in the development of any real-time system is the generation of a consistent design that satisfies an authoritative specification of requirements. The book does not focus on issues of design, but rather the investigation of language and operating system primitives, which allow designs to be realized. This section will give a brief overview of some of the typical stages that are passed through in the top-down design approach: - requirements specification - during which an authoritative specification of the system's required functional and meta-functional behaviour is produced; - architectural design - during which a top-level description of the proposed system is developed; - detailed design - during which the complete system design is specified; - coding - during which the system is implemented; - testing - during which the efficacy of the system is tested.
43Edit
44Requirement specification
45Almost all computing projects start with an informal description of what is desired, followed by an extensive analysis of requirements. In this stage, functionality is defined, temporal behaviour made explicit and reliability requirements and the desired behaviour of the software in the event of component failure. The environment must also be modelled. Since interaction with the environment is especially important for a real-time system, maximum rate of interrupts, maximum number of objects and failure modes are all important.
46Edit
47Design activities
48Decomposition and abstraction form the basis of most software engineering methods. Decomposition involves the systematic breakdown of the complex system into smaller and smaller parts, until components are isolated that can be understood and engineered by individuals or small groups.Abstraction enables the consideration of detail, particularly that appertaining to implementation, to be postponed. Sub-levels in the design must have well-defined roles in order to be verifiable. If the specification of the entire system can be verified just in terms of the specification of the immediate subcomponents, the decomposition is said to be compositional. This is an important property when formally analysing programs. Whenever a class (object) is used, typically, it involves the addition of some form of task (process). Both object and task abstractions are important in the design and implementation of reliable embedded systems. These forms of encapsulation lead to the use of modules. From the definition of modules, more sizable components can be defined that may even be re-usable in subsequent designs. Cohesion and coupling are two metrics that are often used to describe the relationships between entities within a design. Cohesion is concerned with how well a module holds together - its internal strength. Allworth and Zobel gave six measures of cohesion that range from the very poor (Coincidental) to the strong (Functional). Coupling, by comparison, is a measure of the interdependence of program modules. Two modules passing control information between the are said to be tight, or have high coupling. On the other end you have loose coupling if only data is communicated. A loose coupling means that the module is easy to replace with an other one, ie what we want. A good decomposition is one with strong cohesion, and loose coupling. This principle is equally true in sequential and concurrent programming domains.
49Edit
50Testing and simulation
51To assist in any complex testing activity, a realistic test bed presents many attractions. For software, such a test environment is called a simulator. Using a simulator helps create normal, as well as abnormal system behaviour without having to meltdown a nuclear reactor in order to test the system. It is crucial that the simulator are able to reproduce accurately the sequence of events expected in the real system. As mentioned before, high reliability requirements are the essence of most real-time systems, as it is clear that testing must be extremely stringent.
52Edit
53Languages for programming real-time systems
54There are three classes of programming languages to be identified; assembly languages, sequential systems implementation languages and high-level concurrent languages. These will be reviewed, after introducing some general language design criteria.
55Edit
56General language design criteria
57Young (1982) lists the following six criteria as the basis of a real-time language design: security, readability, flexibility, simplicity, portability and efficiency.
58Edit
59Assembly languages
60Initially, most real-time systems were written in assembly language. Mostly due to the lack of support in high-level languages on microcomputers. Assembly language achieves efficient implementations, but is machine-oriented rather than problem-oriented. This keeps development costs high and makes maintenance difficult and expensive.
61Edit
62Sequential systems implementation languages
63Most common programming languages used for real-time programming today, are sequential. They also tend to be weak in the facilities they provide for real-time control and reliability. As a result of these shortcomings, it is often necessary to rely on operating system support and assembly code inserts.
64Edit
65High-level concurrent programming languages
66The software crisis in the 1970s had several symptoms. In short terms, the general language design criteria discussed above was seldom met. This was due to the fact that there were to many general-purpose languages in use. In the American Departement of Defence (DoD), there were over 450 different languages used in their embedded computer applications. DoD concluded that a single language was a desirable goal, and that no current language was suitable. So they decided to develop a new language, which resulted in Ada in 1983. Ada was later updated in 1995, 2005 and 2012, which is the current version.
67Edit
68Reliability and fault tolerance
69Real-time and embedded systems usually have much higher requirements for reliability or safety. It is pretty annoying when Spotify crash or Matlab hangs, but these problems have little to no impact on the surroundings. A control system for, say avionics, have much higher demands on reliability.
70Edit
71Faults and failure
72Let us first clear up some definitions
73Note that this implies a system maintain its reliability until a failure, and a failure only happens when the system does not interact with the environment in the prescribed manner. This means a reliable system either prevent activation of faults, or prevents the error from propagating through the system. This chain is shown in figure 1. Note that a failure in a subsystem will lead to a fault in a system. There are four sources of faults that can result in a system failure:
74 • Inadequate specification
75 • Software faults
76 • Hardware faults
77 • Interference in the communication subsystems
78The last two are in some ways predictable errors in that the system designer will know what happens. For example, if a network module fails, a message will not be delivered. However, when an error occurs from incorrect interaction or wrong code, the result can be more unpredictable, as the scope of these effects can not be forseen as clearly as a hardware error. In this regard, a fault is the source of misbehaviour. Without regards to the above sources, there are three types of faults that can exist in a system. Also note that most system faults originate from incorrect or incomplete specification.
79It is worth noting that software faults have a special name: The notorious bug. Bugs can be divided into two categories, with the names taken from their quantum physics counterparts. Bohrbugs are easily reproduced and easy to find and thus remedy. Heisenbugs are more complex bugs that arise from certain, often complex, combinations of events. A good example of this are threads that are not properly synchronized, and produce random behaviour subject to the state of each thread.
80Edit
81Failure modes and fault prevention
82There are, generally speaking, two ways a system can fail. When a system produce a value failure, the output is wrong or even corrupted/unusable. When the system does not deliver the value in the right time slot, it has produced a time failure. A combination of these two are called an arbitrary failure. Note that while time failure intuitively covers delayed data, undelivered (delayed infinitely) and even information that is delivered too early are also considered time failure.
83The book also give an extensive list of different ways a system might fail. These are given in a bullet list on page 32, and covers everything from complete chaos (uncontrollable failure) to an immune system (never fail) via the controlled shutdown/reduced operation failures. Fault prevention is an approach that aims to reduce the failures in a system by reducing the number of faults. The two main ways to do this is pretty intuitive: You can either find and remove the faults, or you can prevent their introduction in the first place. These are known as fault removal and fault avoidance. While there are hardware aspects to fault avoidance, such as using components that are suited for the task at hand, the most important is to write software with as few faults as possible. There are several ways to do this. Design methods, for example UML, and programming languages designed for real-time systems, such as Ada, will help reduce the number of faults. Extensive testing and simulation of the system will also help reduce the number of faults present during operation.
84Edit
85Fault tolerance and redundancy
86Even though all precautions might have been taken during design, there will, in practice, still be faults in the system when it goes live. No software programmer writes 100% correct code all the time. No design specification is complete, consistent and unambiguous. Testing can only show the presence of faults, not their absence. Thus a system designer has to make sure the system continues to operate when a fault activates. This is known as fault tolerance, and curriculum describes three different levels.
87In order to achieve this, system designers introduce extra elements to detect and handle faults. This is called redundancy, and is possible to implement in both software and hardware. We will focus on the software part, where N-version programming and recovery blocks are the most common techniques.
88Edit
89N-version programming
90N-version programming relies on several pieces of software to compute the same result. There are several degrees of diversity, ranging from using different compilers for the same program, to let several teams build the same functionality in different programming languages. During execution, a driver process handles starts the programs, receive the result and act depending on the output. If the programs does not produce the same result, the driver might choose the most common result, ask the processes to compute it again, or simply terminate the faulty process. One of the downsides to this solution is that, in a real-time system, the driver process and the different programs might need to communicate between them, introducing a communication module as well. Add this to the fact that software is by far the most extensive and time consuming part of a real-time system, and the project costs might double by introducing a redundant language. There are also increased strains on hardware, or even need for separate hardware pieces, which increases the cost and/or complexity of the system. As a final note, recall that most software faults originate from the specification ? thus all N versions might suffer from the same fault.
91Edit
92Dynamic redundancy
93Note that N-version programming is a case of static redundancy. All N versions of the program runs continuously no matter if the system fails or not. An alternative to this is to have modules in the background and only runs when an error is detected. This could decrease resource usage while still handling errors properly. This type of redundancy has four stages.
94Edit
95Error detection
96Since the recovery system is not continually running, there needs to be some way to detect a fault. There are many triggers that could be used.
97 • Replication checks are an extension to N-version programming that starts a recovery routine if one or more of the N results are wrong. Also consider a 2-version system: We cannot know which result is correct, but we know something is wrong, and action has to be taken.
98 • Timing checks take the form of a watchdog timer that each process have to reset within a set interval. Note that timing checks can only be used to check for time failures ? Other error detection systems have to be used for value failures.
99 • A reversal check can be used if there is a one-to-one relationship between the input and output. Thus, given an output, we can do a reversal calculation and compare it to the original input2.
100 • Coding checks rely on extra information sent along with the data, which can be compared on the receiving side. This is often used in communication. For example, imagine an integer is being sent on the network. A checksum might be the square of the number, which is calculated on the sending side. When both numbers are received, the square root of the second number is compared to the first: A mismatch is proof of error.
101 • Reasonableness checks rely on the knowledge of the system. This includes range checks, such as date and time validation.
102 • Structural checks are used to ensure the data structures are correct. We know the number of items in a list; if this changes when it is not supposed to, something is wrong. Go users might recall that the json.Marshal-function produce an error when the information it has received does not fit the target struct.
103 • Dynamic reasonableness checks the current output to the current output. Imagine a system on the form x′=ax suddenly change from a large positive value to a small negative value: This jump should not occur, and something is wrong.
104Edit
105Damage confinement
106Normally, errors are not detected the instant it occurred. Thus when an error is detected, the system have to figure out how far it has spread. Damage confinement concerns how the rest of the system is affected by an error. The two main techniques of doing this are modular decomposition and atomic actions. Modular decomposition emphasizes on breaking down the system into components where each component only communicates with another through well-defined interfaces. The internal details are hidden from other modules. If done properly, all errors are detected before another module needs the result. Atomic actions will be described in a later chapter.
107Edit
108Error recovery
109When the error is found and we know the extent of it, the recovery process can start. There are two methods of error recovery: Forward and backward recovery. Forward recovery involves continuing as before with selected corrections. While this is an efficent method of error recovery, it requires a good understanding of what the error is and how it can be remedied, else the corrections might only be for the worse. Backward recovery steps back to a previous state, called a recovery point, and tries to accomplish the same action with a different method. This works straightforward when there are no threads, but it quickly gets complicated with multiple threads, because of communication between threads: If we need to roll one thread back to an earlier recovery point, we might have to undo a message sent to the other thread. If this is the case, the other thread might have to roll back, requiring messages sent to other threads to be undone. Thus since one thread has had an error, multiple other threads will have to roll back as well. This is called the domino effect.
110Edit
111Fault treatment
112While error corrections remedy the error, it does not remove the underlying fault. This fault might be easy to identify and/or remedy, and the experienced error will not happen again. Both hardware and software faults are relatively easy to fix once identified if the system is allowed to be brought o?ine. If the program have to be modified while executing, things get a little more complicated.
113Edit
114Recovery blocks
115Recovery blocks are an implementation of dynamic recovery applied to software. It defines all entries to a software block as a recovery point, and then have an acceptance test before the block can exit. The method is summarized in Figure 2. Note that the acceptance test can involve several of the methods discussed for error detection. For a short summary of the difference between N-version programming and recovery blocks, read chapter 2.7.
116Edit
117Measuring reliability
118Hardware reliability is pretty simple to measure: Take a bunch of similar components, test them, and establish a measure of reliability from the amount of components that fail. Software reliability, on the other hand, is much harder to measure as code only comes from one source, and it does deteriorate over time. Thus new measurements of reliability have to be established. First, the notion that software was either working or not working was established. As discussed, software can be designed to maintain only critical functionality when shit hits the fan, and thus this boolean interpretation of software reliability is not accurate enough. A more recent definition of software reliability is based on probability. More specifically, it is determined as the probability that a given program will operate correctly in a specified environment for a specified length of time. One main way to calculate this is to use a growth model, increasing reliability for every fault that is corrected in the system. Another method is to calculate this probability statistically by applying a number of test cases to the system and, without correcting any fault, determine the amount of correct cases.
119Edit
120Exceptions and exception handling
121There are five general requirements for an exception-handling facility: - As with all language features, the facility must be simple to understand and use.
122 • The code for exception-handling should not be so obtrusive as to obscure understanding of the program's normal error-free operation. A mechanism which intermingles code for normal processing and exceptional processing will prove difficult to understand and maintain. It may well lead to a less reliable system.
123 • The mechanism should be designed so that run-time overheads are incurred only when handling an exception. Although the majority of applications require that the performance of a program which uses exceptions is not adversely affected under normal operating conditions, this may not always be the case. Under some circumstances, in particular where speed of recovery is of prime importance, an application may be prepared to tolerate a little overhead on the normal error-free operation.
124 • The mechanism should allow the uniform treatment of exceptions detected both by the environment and by the program. For example, an exception such as arith- metic overflow, which is detected by the hardware, should be handled in exactly the same manner as an exception raised by the program as a result of an assertion failure.
125 • The exception mechanism should allow recovery actions to be programmed.
126Edit
127Exception handling in older real-time languages
128Unusual return value is one of the simplest methods for exception handling, and works as follows:
129if (function_call(parameters) == AN_ERROR) {
130 /* error handling code */
131} else {
132 /* normal return code */
133}
134If a function already returns a value and it is not possible to partition the range of values to indicate an error, then status flags are used. These are atomic shared variables which can be set and tested. Using forced branch the instructions immediately following the subroutine call is skipped to indicate the presence (or absence) of an error. This is achieved by the subroutine incrementing its return address (program counter) by the length of a simple jump instruction to indicate an error-free (or error) return.
135Edit
136Modern exception handling
137While most of the old real-time languages has error handling that changes the flow of execution in the program, most modern languages has exception handling built directly into the language. There are four classes of exceptions.
138 • Detected by the environment and raised synchronously - an array bounds violation or divide by zero are examples of such exceptions.
139 • Detected by the application and raised synchronously - for example, the failure of a program-defined assertion check.
140 • Detected by the environment and raised asynchronously - an exception raised as the result of power failure or the failure of some health-monitoring mechanism.
141 • Detected by the application and raised asynchronously - for example, one process may recognize that an error condition has occurred that will result in another process not meeting its deadline or not terminating correctly.
142Chapter 3 mainly focus on synchronous exception handling. Ada requires exceptions to be declared as constants, while Java and C++ have a more object-oriented view.
143Edit
144Exception propagation
145If an exception is raised it's not always a handler associated with that block that takes care of it. There are two possible methods for dealing with a situation where no immediate exception handler can be found. The first approach is to regard the absence of a handler as a programmer error which should be reported at compile-time. This can be challenging as it's not always possible for the compiler to check whether the calling context includes the appropriate exception handlers. The second approach is to look for handlers up the chain of invokers at run-time; this is called propagating the exception. This is the approach taken in Ada, Java and C++. If an exception is propagated outside its scope, it will be impossible to find a handler. Most languages have a 'catch all' exception handler to solve this potential problem.
146Edit
147Exception handling in Ada, Java and C See textbook for examples.
148Edit
149Ada
150Supports explicit exception declaration in the same fashion as constants, with keyword exception. Can be declared in the same place as any other declaration and has the same scope. Some standard exceptions, like Constraint_Error and Storage_Error has the whole program as scope. Every block in Ada can contain an optional collection of exception handlers declared at the end of the block. They have the following syntax: when (the name of the identity) => (action). see textbook for more details. when others can be used as the last exception-handling choice to pick up all exceptions not listed previously. In Ada an exception which is raised and not handled by a subprogram is propagated to the caller of the subprogram. Therefore, such an exception will only be handled by the initialization code if it itself called the subprogram. See textbook (p.73) for example. The exception can also be propagated by a program re-raising the exception in the 3 EXCEPTIONS AND EXCEPTION HANDLING 24 local handler. This facility is useful in the programming of last wishes. See textbook (p.74) for example. Exceptions will never meet requirement 3, as detecting possible error conditions will require some resources. The Ada language provides a facility that checks if standard exceptions raised by the run-time environment becomes too costly for a particular application. Difficulties with the Ada model of exceptions:
151 • It can be hard to keep track of where an exception is raised.
152 • Ada does not allow a full range of parameters to be passed to handlers only a character string. This can be inconvenient if an object of a particular type needs to be passed.
153 • It is possible for exceptions to be propagated outside of the scope of their declaration. Such exceptions can only be trapped by others. However, they may go back into scope again when propagated further up the dynamic chain. This is disconcerting, although probably inevitable when using a block structured language and exception propagation.
154Edit
155Java
156Java supports a termination model of exception handling, like Ada, but the exceptions are integrated into the object-oriented model. Raising an exception in Java is called throwing the exception, and it can only be han- dled from within a try-block. Each handler is specified using a catch statement, that works like a function declaration. A handler with parameter type T will catch a thrown object of type E if:
157 • T and E are the same type; or
158 • T is a parent (super) class of E at the throw point.
159As with Ada, if no exception is found in the calling context of a function, the calling context is terminated and a handler is sought in a try-block within its calling context. Java supports a finally clause as part of a try-block. The code attached to this clause is guaranteed to execute whatever happens in the try-block irrespective of whether exceptions are thrown, caught or propagated, or, indeed, even if there are no exceptions thrown at all. To illustrate this, consider the following example, which will return false:
160try {
161 return true;
162} finally {
163 return false;
164}
165Edit
166C
167C does not define any exception-handling facilities within the language. This limits the possibilities of programming reliable systems, but it is possible to provide some form of exception-handling mechanism by using the macro facility of the language. To implement a termination model in C, it is necessary to save the status of a program's registers, and so on, on entry to an exception domain and then restore them if an exception occurs. This can be done using setjmp and longjmp.
168Edit
169Summary
170Although many different models exist they all address the following issues:
171 • Exception representation - an exception may, or may not, be explicitly represented in a language.
172 • The domain of an exception handler - associated with each handler is a domain which specifies the region of computation during which, if an exception occurs, the handler will be activated. The domain is normally associated with a block, subprogram or a statement.
173 • Exception propagation - this is closely related to the idea of an exception domain. It is possible that when an exception is raised there is no exception handler in the enclosing domain.
174 • Resumption or termination model - this determines the action to be taken after an exception has been handled.
175 • Parameter passing to the handler - this may or may not be allowed. The table below shows the exception-handling facilities of various languages. 
176Language
177Domain
178Propagation
179Mode
180Parameters
181Java
182Block
183Yes
184Termination
185Limited
186Java
187Block
188Yes
189Termination
190Yes
191C++
192Block
193Yes
194Termination
195Yes
196CHILL
197Statement
198No
199Termination
200No
201CLU
202Statement
203No
204Termination
205Yes
206Mesa
207Block
208Yes
209Hybrid
210Yes
211It is not unanimously accepted that exception-handling facilities should be provided in a language. To sceptics, an exception is a goto where the destination is indeterminable and the source is unknown. They can, therefore, be considered to be the antithesis of structured programming.
212Edit
213Concurrent programming
214Edit
215Introduction
216Definition: Concurrent programming is the name of techniques used to express potential parallelism and solving the resulting synchronization and communication challenges. Motivation behind concurrent programming:
217 • Real-world systems (robots etc.) have a parallel nature.
218 • Modern processors are much faster than the I/O-devices with whom they interact, and a sequential program that is waiting for I/O is unable to perform other operations.
219 • To allow more than one processor to solve a problem.
220 • Without concurrency, the software must be constructed as a single control loop. This structure cannot retain the logical distinction between system components. Speed-up of parallelizing: Amdahl's law gives that: Max speed-up=1(1−P)+PN′ where P is the proportion of code that benefits from parallelization and N is the number of available processors.
221Edit
222Terms and definitions
223Edit
224Task/thread/process
225A task/thread is the unit of parallelism and a single thread of control. A process is threads working within its own shared memory context.
226Edit
227Fundamental facilities
228All concurrent programming consist of these fundamental facilities:
229 1. Activities (threads, tasks or processes).
230 2. Synchronization mechanisms.
231 3. Support of communication between concurrent activities.
232Edit
233Types of interaction
234Interaction between tasks can be grouped as one of the following: - Independent: No synchronization or communication between threads. - Cooperate: Synchronization and communication. - Competing: About their share of resources. Although they communicate and synch to obtain resources, they are essentially independent.
235Edit
236Hierarchy
237Based on hierarchy and relation, one often distinguish between tasks to be:
238 • Parent/child: Responsibility for the creation of another task. The parent may be delayed while the child is being created and initialized.
239 • Guardian/dependant: A task may be dependent on the guardian task itself. The guardian is not allowed to terminate until all dependent tasks have terminated.
240 • Sometimes the parent is also the guardian, but not necessarily with dynamic tasks.
241Edit
242Concurrent execution
243Edit
244Variation between the languages
245The models of concurrency for the programming languages varies, especially when it comes to:
246 • Structure: The number of tasks can be either:
247 ◦ Static: fixed at compile-time
248 ◦ Dynamic: tasks are created any time
249 • The level of parallelism: Tasks can be either:
250 ◦ Nested: allowed to be defined within other tasks
251 ◦ Flat: not nested
252 • Granularity (detaljnivå): Within languages that supports nested constructs we distinct between:
253 ◦ Course grain parallelism: Program contains few tasks, each with a longer life. (Applies to most concurrent programming languages, typified by Ada.)
254 ◦ Fine grain parallelism: Program contains a large number of simple tasks, some of which exist only for a single action (e.g Occam2).
255 • Initialization: The tasks can be supplied with information by:
256 ◦ Passing it as parameters (most modern languages allow this)
257 ◦ Communicating explicitly after is has commenced its execution.
258 • Termination: Can be done in a variety of ways:
259 ◦ Normal completion of execution of the task's body
260 ◦ Suicide, by a self-terminate statement
261 ◦ Abortion, through action of another task
262 ◦ An untrapped error condition
263 ◦ Never (when executing non-terminating loops)
264 ◦ When no longer needed
265Edit
266Task representation
267The three basic mechanisms for representing concurrent execution is: fork and join, cobegin and explicit task declaration. The latter is most used in modern real-time languages.
268Edit
269Fork and join
270The fork statement specifies that a designated routine should start executing concurrently with the invoker of the fork. The join statement allows the invoker to synchronize with the completion of the invoked routine. For example:
271A procedure P(parent) begins and forks function F(child). They will now execute concurrently until the point when P states join. Here P will wait for F to finish (if it hasn't already done so). Fork and join allow for dynamic task creation and provide a means of passing information to the child via parameters. This can however be error prone in use, for example, in some systems a guardian must explicitly rejoin all dependants rather than merely wait for their completion.
272Edit
273Cobegin
274The cobegin denotes the concurrent execution of a collection of statements, from S1 to Sn. The cobegin terminates (equals join in the "Fork and join") when all statements have terminated. Data can be passed to the invoked tasks via parameters of the procedure call. A cobegin statement can even include a collection of statements that itself has a cobegin within it.
275Edit
276Explicit task declaration
277The routines within the "Explicit task declaration" can state themselves whether they will be executed concurrently. This has become the standard way of expressing concurrency in real-time languages.
278Edit
279Language parallellism
280Edit
281Ada
282The unit of parallelism is called task in Ada. Tasks consist of a specification and a body. They can be passed initialization data upon creation, but only discrete types and access (pointer) types. The tasks can be given as task types. Tasks that do not have types declared for them and have no pointers to them are called anonymous.
283Dynamic tasks can be created by giving a non-static value to the bounds of an array (of tasks) or using the 'new' operator on an access type (of task type). Tasks created by the ('new') allocator have the important property that the guardian (called master in Ada) is not in the block in which it was created, but in the one that contains the declaration of the access type. That means that if task Y is created inside an inner block, but was first declared outside, the inner block can terminate even though Y hasn't yet terminated. Termination: A task will terminate as it completes its body, there is an unhandled exception, when no longer needed (done by a select statement, or when it is aborted. Any task can abort any other task, and as a result, all its dependants will also abort. This is a dangerous action, and should be used only when no other alternative actions are possible.
284Edit
285Java
286Java has a predefined class, java.lang.Thread which provides threads. Java also has a interface, called Runnable, to express concurrent execution. This provide a run method.
287There are two ways to create threads. One way is to declare a class to be a subclass of Thread and override the run method. The second way is to declare a class that implements the Runnable interface. Common for all threads in Java is that they are not created automatically when their associated objects are created, but must be explicitly created and started using the start method (e.g thread1.start();).
288Java allows dynamic thread creation like Ada, but unlike Ada, Java allows arbitrary data to be passed as parameters. Java has no guardian concept, and relies on garbage collection to clean up objects no longer accessible. But the main function still terminates when all its threads are terminated. The join method works like described in "fork and join", and the isAlive method tells whether a target has terminated.
289Deamon threads provide genereal service and typically never terminate until they are no longer needed (like in Ada using the select statement). In Java, one does not longer have the same opportunity to abort threads like in Ada.
290Edit
291C/Real-Time POSIX
292C/Real-Time POSIX provides three mechanisms for creating concurrent activities. The first is the fork and wait mechanism. The second is spawn, which is a combination of fork and exec. It also allows for a process to contain several threads like in Ada and Java. A thread can be executed when it is called by pthread_create. Here the parameters are passed.
293Termination: A thread can be terminated normally by returning from its start_routine or by calling pthread_exit. One thread can wait for another to terminate by the pthread_join function. The cleaning up and reclaiming the storage of a thread is called detaching.
294Edit
295Multiprocessor and distributed systems
296Definitions:
297Global dispatching means that a task can be given to all available processors. From a real-time perspective, fixing tasks to certain processors usually will result in a more predictable response. Linux only allow threads to be constrained to execute on a limited set of processors. This is called processor affinity. Although Ada, Real-Time Java and C/Real-Time POSIX all support multiprocessor implementations, they do not provide mechanism to set the processor affinity of a task.
298Edit
299Steps to make a distributed system
300Production of an application for a distributed system involves some steps not required when programs are produced for a single or multiprocessor platform.
301Edit
302Language-supported vs operating-system-supported concurrency
303There has been a long debate as to whether the support for concurrency should be provided by the language or the operating system. Pros and cons for the former follows: Pros:
304 • More readable and maintainable programs.
305 • There are many different operating systems.
306 • An embedded computer may not have any resident operating system.
307Cons:
308 • Different languages have different models of concurrency.
309 • It may be difficult to implement a language's model of concurrency efficiently on top of an operating system's model.
310 • Operating systems API standards, such as POSIX, have emerged and therefore programs are more portable.
311Edit
312Shared variable-based synchronization and communication
313The relevant learning goal for this chapter is: The ability to use (correctly) and evaluate mechanisms for shared variable synchronization. Discussions about issues when using multi-core architectures are left out. In short, a correct program will work on both single-core and multi-core + shared memory architectures. However, errors may not be detected on a single-core system due to the sequential nature of a single-core environment. Problems with integrating concurrency with object oriented programming (such as in Java) is also not covered.
314Edit
315Introduction
316The correct behaviour of a concurrent program is critically dependent on synchronization and communication between tasks. The classical errors data race, deadlock, livelock and resource starvation are a consequence of incorrect synchronization and/or communication of concurrent tasks. There are two main ways of performing task communication, either with shared variables or message passing. In this section, shared variables are discussed. An example of message based communication combined with synchronization is Google Go's unbuffered channels. To introduce some important concepts, consider two tasks with a shared integer X. Both tasks wish to update the value of this integer with the following statement: X := X+1 The problem with this statement is that it is not executed as an indivisible (atomic) operation, but is in fact split into instructions that can become interleaved. We would like some way to assure that only one of the tasks execute this statement at any given time. We call a sequence of statements we require to execute indivisibly a critical section. The synchronization required to protect a critical section is called mutual exclusion.
317Edit
318Signals and Busy waiting
319The simplest way for two tasks to communicate is with a shared signal variable. One of the tasks continuously reads the shared variable until another task sets this signal variable to some value. The first task is then allowed to continue, since a synchronization has occured. The main problem with this is that the first task has to continuously poll the signal, this is called busy waiting and is very resource demanding. In addition,implementing higher level features like mutual exclusion is complicated and hard to prove correct (especially when more than two tasks are involved). With busy waiting, a state called livelock can occur. This happens when a task is busy waiting and does not receive a signal in finite time. This means that the task cannot continue to execute and will in addition spend resources on busy waiting.
320Edit
321Semaphores
322The most basic building block in shared variable synchronization is the semaphore. It is therefore important to understand what it does, how it can be implemented, its limitations, and why it works. An important feature of the semaphore is that it eliminates the need for busy waiting. The semaphore is conceptually a shared non-negative integer equipped with two methods; wait and signal. It is imperative that the implementation of these methods are atomic, i.e two tasks executing wait at the same time will not interfere with each other. The two methods works as follows.
323Edit
324Error conditions and liveness
325The error condition livelock was explained in section 5.1. Another error condition introduced by synchronization mechanisms is the deadlock. This is a state where a set of tasks cannot proceed due to conflicting synchronization demands. Consider the following two tasks, with two synchronization semaphores S1 and S2.
326// Task 1
327wait(S1);
328 wait(S2);
329 ...
330 <critical section>
331 ...
332 signal(S2);
333signal(S1);
334and
335// Task 2
336wait(S2);
337 wait(S1);
338 ...
339 <critical section>
340 ...
341 signal(S1);
342signal(S2);
343In this example, we risk interleaving the wait calls such that Task 1 is locked at the call to wait(S2), while Task 2 is locked at wait(S1) and none of them can continue their operation. This is called a deadlock. The third error condition, which is less severe, is called starvation. This is when a task is not given access to a needed resource in finite time. If it can be proven that a task is free from livelocks, deadlocks or starvation, we say that it possesses the liveness property.
344Edit
345Conditional critical regions
346The semaphore is simple, but is error prone when implementing complex synchronization features between multiple tasks. To resolve some of these issues, we introduce a language feature called Conditional Critical Regions (CCRs). This language feature is designed to implement mutual exclusion in critical regions. The idea is to group shared data, and label this shared data as a resource. Whenever we have a block of code that needs to use this resource, we label it as a region using this resource with a conditional guard. The CCR feature ensures that any two regions associated with the same resource do not execute simultaneously (thus giving mutual exclusion). The following example illustrates how a language could include CCRs:
347// Pseudo−code to illustrate Conditional critical regions
348shared_var_t := Integer ;
349SHARED_VAR_MAX : Integer := 50;
350resource shared_var : shared_var_t := 0;
351
352task P1;
353 loop
354 region when shared_var < SHARED_VAR_MAX do
355 shared_var := shared_var + 1;
356 end region ;
357 end loop ;
358
359task P2;
360 loop
361 region when shared_var < SHARED_VAR_MAX do
362 shared_var := shared_var + 2;
363 end region ;
364 end loop ;
365
366task P3;
367 loop
368 region when shared_var >= SHARED_VAR_MAX do
369 shared_var := 0; // reset the counting
370 end region ;
371 end loop ;
372We see that we introduce two new keywords, resource and region to use CCRs in our language.
373Edit
374Monitors
375The main problem with CCRs are that the blocks of code referring to a shared resource can be dispersed throughout the program. Monitors are introduced as a way to structure both resources and procedures together. A monitor has a set of variables, and a set of procedures operating on these shared variables. Each call to a procedure is protected by mutual exclusion provided by the programming language.
376The main problem with the monitor is that it still relies on some sort of internal conditional synchronization. This can be implemented with a semaphore-like structure called a condition variable. The similarities with the semaphore brings along the weaknesses of the semaphore; the condition variable is error prone and can be complicated to use.
377Edit
378Protected objects
379In Ada, a language feature called protected objects improves the monitor by using guards instead of condition variables on its internal procedures which need conditional synchronization. If a procedure in a protected object has a guard, this is evaluated if a task attempts to execute this procedure. If the guard evaluates to False, the task is suspended. The condition is re-evaluated every time a task exits a procedure in the protected object.
380Edit
381Message-Based Synchronization and Communication
382Message passing is introduced as an alternative to shared variable synchronization and communication. This approach is typified by the use of a single construct for both communication and synchronization. There are a wide variety in the semantics of message passing, because of this three major issues are considered and discussed:
383 • The model of synchronization
384 • The method of task naming
385 • The message structure
386Edit
387Process Synchronization
388In a message-based system a receiver, obviously, cannot read a message before it is sent. This is opposed to shared variable synchronization, were a receiver will not know whether the variable has been written to or not. The task synchronization of the sender in a message-based system can be broadly clas- sified in three categories:
389 • Asynchronous - the sender proceeds immediately after sending the message
390 • Synchronous - the sender waits for the message to be received
391 • Remote invocation - the sender proceeds only when a reply has been returned
392These three forms are related, as two asynchronous events can constitute synchronous send (see Ex. 2). Likewise two synchronous (or asynchronous) send can constitute remote invocation (see Ex. 3).
393Using asynchronous tasks for synchronized messaging
394Task 1: Task 2:
395asyn_send(message) wait(message)
396wait(acknowledgement) asyn_send(acknowledgement)
397Using synchronous tasks for remote invocation messaging
398Task 1: Task 2:
399syn_send(message) wait(message)
400wait(reply) ...
401 construct reply
402 ...
403 syn_send(reply)
404As asynchronous send can be used to construct the other two, this model gives the most flexibility. Because of this, languages and operating system should support this model. There are, however, some drawbacks to consider when using asynchronous send:
405 • Infinite buffers might be needed to store messages that have not yet been read (perhaps because the receiver has terminated)
406 • Asynchronous communication is out-of-date, as most sends are programmed to expect an acknowledgement (synchronous communication)
407 • More communication is needed with the asynchronous model, hence increased complexity
408 • It is more difficult to prove correctness of the complete system
409Edit
410Task Naming and Message Structure
411In naming task there are two things to consider: direct vs. indirect naming and symmetry. In a direct naming scheme the sender explicitly names the receiver:
412send<message>
413to
414<task-name>
415An indirect naming scheme, on the other hand, names an intermediary (known variously as a channel, mailbox, link or pipe):
416send<message>
417to
418<mailbox>
419A naming scheme is symmetric if both the sender and receiver name each other (directly or indirectly). It is asymmetric if the receiver do not name a specific source, hence accepting messages from any sender. Asymmetric naming fits the client-server paradigm, as a server can receive messages from, and do services for any number of clients (tasks). Indirect asymmetric naming have some further issues, as the intermediary can have different structures:
420 • many-to one
421 • many-to-many
422 • one-to-one
423 • one-to-many
424Edit
425Message Structure
426A language should allow any date object of any type (pre- or userdefined) to be sent as a message. This can be difficult in practice, especially if the sender and receiver have different data representation.
427Edit
428Selective Waiting, Non-Determinism and Synchronization Primitives
429Message passing has so far been assumed to make the receiver wait for a message. This is often too restrictive, as the receiver may want to wait for several different tasks to call. To facilitate this common need, receivers can be allowed to wait selectively for messages from different tasks.
430Dijkstra's guarded command is a programming structure that is used to solve this. A guarded command is executed if its guard evaluates to TRUE, and is an element of a guarded command set. If several commands in a set evaluates to true; the choice of which is allowed to proceed, should be arbitrary. Meaning, the construct is non-deterministic, this is opposed to if-statements.
431The reason for making this choice non-deterministic, is that most concurrent languages usually make few assumptions to which tasks are executed when. The scheduler is also assumed to be non-deterministic to when different tasks are scheduled. The same argument can be applied to any queue defined by a synchronization primitive, as non-deterministic scheduling implies that a queue should release tasks non-deterministic.
432Edit
433Message Passing in Ada
434The semantics of remote invocation have many superficial similarities with a procedure call in Ada. Data is passed to a receiver, the receiver executes and then data is returned. Because of this similarity, Ada supports the definition of a program's message in a way that is compatible with the definition of procedures, protected subprograms and entries.
435For a task to be able to receive a message, it must define an entry. Entries can be defined as private, meaning they can only be called by tasks local to the task's body. There is also provided facilities for defining, in effect, an array of entries. These are called an entry family (see the multiplexor on p.197 for an example). To call a task (send a message) simply involves naming the receiving task and its entry directly. If an entry call is made on a task that is no longer active then the exeption Tasking_Error is raised at the point of call.
436In Ada data can be transferred in the opposite direction of the message itself. Because of terminology confusion that can arise from this, message passing is usually called 'extended rendezvous'.
437All entries should have an accept statement associated with it, this statement must be used in a tasks body. The naming of an accept statement is asymmetric, as they name the entry and not the task from were the message came. If multiple tasks call the same entry, they are order in a FIFO queue.
438Edit
439Exeption Handling and the Rendezvous
440Any valid Ada code can be executed during a rendezvous, therefore it is possible for an exception to be raised within the accept statement itself. If this occur the exception can either be handled appropriately inside the statement itself, or the statement will terminate and the exception should be handled outside. If the exception is not handled inside statement scope error may occur.
441Edit
442Message Passing via Task Interfaces
443Ada's direct asymmetric naming means that tasks must be named explicitly. However, this is not always the appropriate behaviour. As a client may not want to be concerned about the tasks type, but only that it provides the appropriate service. Task interfaces are ideal for this purpose. A task type in Ada 2005 can be declared to 'implement' zero, one or more combinations of synchronized and task interfaces.
444Edit
445Select statement
446The select statement in Ada must contain at least one accept alternative (up to as many as you want), and up to one of three further alternatives:
447 • a terminate alternative
448 • an else alternative
449 • a delay alternative
450The general form an select statment is:
451select
452 when <Boolean−Expression> =>
453 accept <entry> do ..
454 end <entry>
455 -- any sequence of statements
456or
457 −− similar
458 ..
459end select
460The terminate alternative terminates the task if no other entries can call the entries of this task. The else alternative is executed if there are no other alternative immediately available. The delay alternative suspends the the task until an entry is detected. If there are multiple entries waiting, the choice of which is evaluated is non-deterministic.
461Edit
462C/Real-Time POSIX message queues
463C/Real-Time POSIX support asynchronous, indirect message passing through the notion of message queues. A message queue can have many readers and writers. Priority may be associated with each message (not part of the curriculum). Message queues are intended for communication between processes, but they can be used for threads as well. See Program 6.1, p. 207 for the C/Real-Time POSIX interface to message queues.
464Edit
465Distributed systems
466In a distributed system were tasks operate on different nodes, the goal is to make the communication between them as easy and reliable as possible. In reality complex communication protocols are needed. A set of mechanism must be provided:
467 • tasks do not need to decode or encode messages.
468 • all messages received by a user task can be assumed to be intact, and in good condition.
469 • messages received by a task is of the type the task expects.
470 • tasks are not restricted to communicate only in terms of predefined, built-in types.
471It is possible to identify three main standards for communication in a distributed system:
472 • externally-defined application programmer's interface (API), to access network transport protocols
473 • remote procedure call (RPC) paradigm
474 • distributed object paradigm
475Edit
476Remote Procedure Call
477RPC is typically used when communication between programs written in the same language is needed. A procedure (server) is identified as being one that can be called. From this two procedures are identified: a client stub and a server stub (see figure 6.1 p.211). The stubs are sometimes called middleware, as they sit between the application and the operating system. The client stub is the caller (server at the client procedure), and the server stub is the receiver (server at the server procedure). The client stub is used for:
478 • identify the address of the server (stub) procedure
479 • convert parameters into bytes suitable for transmission across the network (parameter marshalling)
480 • send the request to execute the procedure to the server (stub)
481 • wait for a reply from the server (stub) and unmarshall the message
482 • return control to the client procedure along with returned message The server stub is used for:
483 • receive requests from client (stub) procedures
484 • unmarshall the received message
485 • call the server
486 • catch any exceptions that are raised by the server
487 • marshal and send return parameters and/or exceptions. If the clients and servers are written in different languages the marshalling and unmarshalling must be machine and language independent.
488Edit
489The Distributed Object Model
490The distributet object model allow:
491 • dynamic creation of an object on a remote machine
492 • identification of an object to be determined and held on any machine
493 • transparent invocation of a method in an object as if it were a local method
494 • transparent run-time dispatching of a method call across the network
495Not all systems supporting this model provide mechanisms for all this functionality.
496Edit
497Atomic actions
498This chapter is meant as a basic explanation of the concepts behind atomic actions. For specific implementations of atomic actions in C, Ada and Java, please see (Burns and Wellings, 2009), chapters 7.2, 7.5-7.7.
499Edit
500Atomic actions
501In concurrent programming, several tasks can run in parallel, each performing operations or actions. With concurrency, it is easy for tasks to interfere with each other when performing the same action. A solution to this problem is by requiring the actions to be run as an indivisible or atomic action. There are several ways of expressing the properties of atomic actions and they are as follows.
502 1. An action is atomic if the tasks performing it are not aware of the existence of any other task, and no other active task is aware of the activity of the tasks during the time the tasks art performing the actions.
503 2. An action is atomic if the tasks performing it do not communicate with other tasks while the action is being performed.
504 3. An action is atomic if the tasks performing it can detect no state change except those performed by themselves and if they do not reveal their state changes until the actions is complete.
505 4. Actions are atomic if they can be considered, so far as other tasks are concerned, to be be indivisible and instantaneous, such that the effects on the system are as if they were interleaved as opposed to concurrent.
506The essence of these statements is that the system always sees a consistent state, and that all actions performed appear to be instantaneous. These concepts are usually not supported natively in programming, so the functionality must be implemented using other more primitive concepts. An atomic action, even though viewed as indivisible, can be composed by internal atomic actions. This structure is called nested atomic actions. To fulfil the stated requirements for atomic actions, it is important that only tasks of outer levels are involved in internal actions.
507Edit
508Two-phase atomic actions
509For an action to be completely atomic, no communication with other parts of the system is allowed. This is however a problem when it comes to acquiring resources, because the task will then communicate with a resource manager like a server task. This leads to two options. The first is to make the server task, which serves all other tasks as well, a part of the atomic action. This will be very inefficient, as it will serialize all tasks requesting data from the server, and removing the concurrency. The second option is to make the atomic communicate freely with resource servers, and this is in reality the only option. For efficiency and increased concurrency, it is advantageous if tasks will wait as long as possible before requiring resources, and freeing them as soon as possible. This could however make it possible for other tasks to detect a change in state before the original task is done performing the atomic action. To avoid this, atomic actions are split into two separate phases, a 'growing' phase where only acquisitions can be made, and a 'shrinking' phase where resources can be released while no new acquisitions can be made. This way, other tasks will always see a consistent system state.
510Edit
511Atomic transactions
512Atomic transactions are like regular atomic actions, with the addition of one feature that the atomic actions will be defined as a success or failure after execution. If an atomic action fails, the system may be left in an inconsistent state. If an atomic transaction fails, the system will return to the state prior to the execution. Atomic transactions are also called recoverable actions for this feature. There are two distinct properties of atomic transactions:
513Even though transactions provide error detection, they are not suitable for fault-tolerant systems. This is because they provide no recovery mechanism in case of errors, and will not allow any recovery procedures to be run.
514Edit
515Requirements for atomic actions
516For implementation of atomic actions in a real-time system, it is necessary with a welldefined set of requirements which must be fulfilled.
517The atomic actions should allow recovery procedures to be implemented, as their intention is damage confinement.
518Edit
519Recoverable atomic actions
520Edit
521Backward error recovery
522The concept of backward error recovery is that when it is applied to a group of communicating tasks, it is possible for the tasks to be rolled back to their starting point of execution. In general, this is problematic because there is no easy way of defining a consistent set of recovery points. However, when an error occurs in an atomic action, the tasks involved can be rolled back to the clearly defined starting point, and alternative procedure can be executed. As atomic actions operate independently, this will guarantee correct results. Atomic actions used in this way are called conversations. Each action statement will contain a recovery module as seen here:
523action A with (P1,P2) do
524ensure <acceptance test>
525by
526- primary module
527else by
528- recovery module
529else error
530end A;
531The basic principles of a conversation can be summarized:
532 • On entry to the conversation, the state of a task is saved. The set of entry points form the recovery line.
533 • While inside the conversation, a task is allowed only to communicate with other tasks active in the conversation and general resource managers. As conversations are built from atomic actions, this property is inherited
534 • In order to leave the conversation, all tasks active in the conversation must have passed their acceptance test. If this is the case, then the conversation is finished and all recovery points are discarded.
535 • If any tasks fails its acceptance test, all tasks have their alternative modules. It is, therefore, assumed that any error recovery to be performed inside a conversation must be performed by all tasks taking part in the conversation.
536 • Nesting is allowed, but only strict nesting is allowed (Inner structure entirely contained in outer).
537 • If all alternatives in the conversation fail, then recovery must be performed at a higher level
538Conversations have been criticized even though they offer recovery options for sets of tasks. If a set of tasks fail, the same set of tasks will try another procedure, but often the case is at a higher level, and the tasks may need to work together with entirely different tasks to recover successfully. This leads to a lot of inefficiency, and the system may not even recover at all.
539Edit
540Forward error recovery
541The principle of forward error recovery is that if an exception occur in one of the tasks active in an atomic action, then that exception is raised in all the active tasks. This exception is called asynchronous as it originates in another task. The following example illustrates the concept:
542action A with (P2 , P3 ) do
543 - the action
544exception:
545 when exception_a =>
546 - sequence of statements
547 when exception_b =>
548 - sequence of statements
549 when others =>
550 raise atomic_action_failure
551end A;
552There are two models of exception handling: termination and resumption. With the termination method, the atomic action will complete normally if all exceptions are handled and no further exceptions are raised. With the resumption method, the tasks will resume where they were when the exception was raised. If one task doesn't have an exception handler or fails to handle an exception, the atomic action will fail and a new exception will be raised in the other tasks.
553Edit
554Asynchronous notification
555In real implementations of recoverable atomic actions, forward and backward error handling are usually combined, and backward error handling can even be implemented with forward error handling. The key to implementing error handling is the asynchronous messages, and most programming languages have some feature allowing asynchronous messages to be sent. There are mainly two principles for this: Termination and resumption.
556The resumption model (often called event handling) is like a software interrupt. A task contains an event handler which will handle predefined events. When the task is interrupted, the event handler is executed, and when it's finished, the task will continue execution from the point where it was interrupted.
557With the termination model, each task specifies a domain in which it will accept an asynchronous signal. If the task is in the defined domain and receives an asynchronous signal, it will terminate the domain. This is called Asynchronous transfer of control or ATC. If a task is outside the defined domain, the ATC will be queued or ignored. When the ATC is finished, the control is returned to the calling task at a different location than from where it was called.
558Edit
559The user need for asynchronous notification
560The main feature of the asynchronous notification is the instantaneous message it provides. It adds complexity to the system, but the alternative to this is to wait for events to occur and processes to finish. In real-time systems, this isn't always good enough. The main reasons for using asynchronous notifications are
561Edit
562Resource control
563Implementation of resource entities requires some form of control agent. If the resource is passive it is said to be protected(synchronized). If the resource is active it is called a server. This chapter discusses the problem of reliable resource control.
564Edit
565Resource control and atomic actions
566 • Tasks need to communicate and synchronize to perform resource allocation, this need not to be in the form of an atomic action, because the only information exchanged is necessary to achieve.
567 • The code necessary for a particular task to communicate with the controller should be an atomic action.
568Edit
569Resource management
570Resources must be encapsulated and be accessed only through a high-level procedural interface.
571 • Package in ADA
572 • Naturally when using monitors in Java and C.
573Edit
574Expressive power and ease of use
575 • Monitors/synchronized methods, servers, protected resources.
576 • Blooms criteria for evaluating synchronization primitives in the context of resource management:
577 • Ease of use of a synchronization primitive encompasses:
578 ◦ the ease with which it expresses each of these synchronization constraints.
579 ◦ the ease with which it allows the constraints to be combined to achieve more complex synchronization schemes.
580 • In the context of resource control, the information needed to express these constraints can be categorized as follows:
581 ◦ the type of service request
582 ◦ the order in which requests arrive
583 ◦ the state of the server and any objects it manages
584 ◦ the parameters of requests
585 • Comparison between condition synchronization and avoidance synchronization:
586 ◦ conditions wait: all requests are accepted, but any task whose request cannot currently be met is suspended on a queue.
587 ◦ avoidance: requests are not accepted unless they can be met.
588Edit
589Request type
590 • Priority one request over another.
591 • Monitor and synchronized methods: Programmed as distinct procedures.
592 • Ada: Represented by different entries in the server task, or protected object.
593Edit
594Request order
595 • Monitor: FIFO
596 • ADA: FIFO calls to the same entry can be serviced FIFO if the appropriate queuing policy is chosen.
597 • ADA: calls to different entry, arbitrary manner. (Out of programmers control) Possible if all clients call a common "register" entry.
598Edit
599Server state
600 • Some operations may be permissible only when the server and the objects it administers are in a particular state.
601 • Monitors could be used with condition variables being used to implement constraints.
602Edit
603Request parameters
604The order of operations of a server may be constrained by information contained in the parameters of request. Typical the size of the request. How can a resource allocater in Ada be constructed to handle this:
605 • Different entries for different sizes.
606 • Two stage interaction: Call first "sign in" and then "allocate" to the server.
607 ◦ Problem if the client sends "sign in" and then being aborted from another task before sending "allocate". Deadlock on the server, if the server wait infinitely.
608 ◦ If the server does not wait infinitely on "allocate" there will be a problem if the client is just slow.
609 ◦ solutions:
610 ▪ Define the abort primitive to apply to an atomic action rather than a task; forward and backward error recovery can then be used when communicating with the server.
611 ▪ Assume that abort is only used in extreme situations where the breaking of the atomic is of no consequence.
612 ▪ Try to protect the server from the effect of client abort.
613Edit
614Requester priority
615Client priority. It is possible in Java, C, Ada to define a priority queue, but in general con- current programming languages, tasks are released from primititves such as semaphores or condition variables in either an arbitrary or FIFO manner.
616Edit
617The requeue facility
618One means of enhancing the usability of avoidance synchronization is to add requeue facility. The Ada language has such a facility.
619The key notion beyond requeue is to move the task (which has been through one guard or barrier) to 'beyond' another guard.
620Ada allows requeues between task entries and protected object entries.
621Edit
622Semantics of requeue
623I gave up the sentence: read chapter 8.4.1
624Edit
625Requeuing to other entries
626Example in the book ch 8.4.2
627Edit
628Asymmetric naming and security
629 • asymmetric naming → the server being unaware of the identity of the client.
630 • A server task may wish to know the identity of the calling client so that: ? - a request can be refused on the gronds of deadlock prevention or fairness(that is, to grant the request would be unfair to other clients). ? - it can be guaranteed that resources are released only by the task that earlier obtained them.
631 • Ada: get the client id from task identification
632 • Java: get client id using thread.currentThread
633Edit
634Resource usage
635 • Shared access: more than one can access at the same time.
636 • Exclusive access: only one can access at the same time.
637 • A combination of those above; Problem a task in shared mode trying to access when one in exclusive mode already has accessed. And the other way...
638 • One allocate the resource should release them as soon as possible.
639 • Releasing too fast → error
640 • Modified two phased resource, so that the resource are not released until the action is completed.
641 • Recovery procedure: success→ release resource. Fail → undo any effects on the resource.
642Edit
643Deadlock
644 • Many tasks competing for a finite number of resources.
645 • livelock
646 • Several tasks are continually attempting to gain sole access to the same resource; if the resource allocation policy is not fair, one task may never get its turn to access the resource. This is called indefinite postponement or servation or lockout.
647 • liveness → in concurrent systemts: if something is supposed to happen it eventually will.
648 • four necessary conditions that must hold if deadlock is to occur:
649 ◦ Mutual exclusion - only on task can use a resource at once (that is the resource is non-sharable or at least limited in its concurrent access). ? - Hold and wait - there must exist tasks which are holding resources while waiting for others. ? - No pre-emption - a resource can only be release voluntarily by a task. ? - Circular wait - a circular chain of tasks must exists such that each task holds resources which are being requested by the next task in the chain.
650 • Three solutions to handle deadlock ? - deadlock prevention ? - deadlock avoidance ? - deadlock detection and recovery
651Edit
652Scheduling real-time systems
653Scheduling: Restricting the non-determinism in concurrent systems by providing ordering of the use the resources of the system, and predicting the worst case behaviour under this ordering. I.e. it has to assign priorities to the different processes, and run a schedulability test. In a static scheduler this is decided/calculated prior to execution, while a dynamic scheduler calculates in run-time.
654Edit
655The cyclic executive approach
656For this simple approach, all tasks are purely periodic and known in advance. Tasks are either running or not initiated, never suspended. It is only a sequence of procedure calls, thus there is no need for semaphores etc. The major cycle determines the longest possible period of a task, while the minor cycle determines the shortest. The scheduler groups the tasks in minor cycles, making sure all tasks run at least one time (the task with the longest period) every major cycle. Everything is planned in advance, however placing the tasks in the cycles is similar to the knapsack problem which is NP-hard.
657Edit
658Task-based scheduling
659A task can now be either running, suspended waiting for a timing event, or suspended waiting for a non-timing event.
660Edit
661Scheduling approaches
662Edit
663Scheduling characteristics
664A schedulability test can be:
665Edit
666Preemption and non-preemtion
667Edit
668Simple task model
669The book makes some assumptions on the tasks: fixed set of periodic, independent tasks with known periods, constant worst-case execution times, and deadlines equal to their periods. They run on a single processor where overheads run in zero time. A critical instant is the instant when the processor runs with maximum load. See table 11.2 page 370 for standard notation used in the book.
670Edit
671Fixed-priority scheduling(FPS)
672Edit
673Utilization-based schedulability tests for FPS
674If the following is true (for rate monotonic ordering), then all N tasks will meet their deadline:
675∑Ni=1CiTi≤N⋅(21N−1)
676sum of utilization of all tasks ≤ total utilization
677I.e. it is sufficient but not necessary. (A system may fail the test, and still be schedulable)
678If we increase the number of tasks to infinity with no restriction on relative period, we can prove that the maximum utilization approaches 69%. This is called asymptotic analysis of utilization.
679Edit
680Improved utilization-based tests for FPS
681The schedulability test from previous section can be improved by reinterpretation of the given equation. Instead of having N stand for number of tasks, it is now defined as number of distinct task families. A task family is tasks that have periods that are multiples of a common value. Ex. 3, 9, 12, here N=1. If the given periods are 3, 6, 10, then N=2 as 10 does not share a common multiple with the other periods. (Read BW ch. 11.4.1)
682Another improved utilization-based test, has the form:
683∏i=1N(CiTi+1)≤2
684Edit
685Response time analysis (RTA) for FPS
686As opposed to the utilization based FPS tests, the RTA tests are exact(sufficient and necessary).
687ωn+1i=Ci+∑j∈hp(i)⌈ωniTj⌉Cj
688Here hp(i) is the set of indexes that have a higher priority than i, ωi is the response time of the i'th task. In other words: response time of task i is equal to the execution time of i + the sum (over all tasks j of higher priority than i) of number of times j is executed while i is running times the execution time of j. To find w numerically we execute equation (2) recursively until ωn+1i=ωni, starting with ω0i≤Ci. The process is repeated for all tasks in the system.
689Edit
690Sporadic and aperiodic tasks
691Up until this point we have assumed D=T , which is unreasonable for sporadic/aperiodic tasks. equation (2) is still valid for D<T, as long as we stop when ωn+1i>Di.
692Edit
693Hard and soft tasks
694Worst case figure is usually considerably higher than average case, thus shedulability testing with worst case figures can cause low processor utilization. To improve this it should be possible to schedule all tasks using average figures, i.e. we allow transient overload; situations where we might not meet all deadlines. However hard real-time tasks should never miss deadlines, so they should be schedulable using worst-case figures.
695Edit
696Aperiodic tasks and fixed-priority execution-time servers
697Edit
698Task systems with D < T
699Deadline monotonic priority ordering(DMPO: Tasks have a fixed priority that is inversely proportional to their deadline. It is optimal for a fixed priority scheme. FSP works well with D < T , while EDF utilization test fails. However since EDF is more effective than FSP, a system that passes an FSP test will also be schedulable for EDF.
700Edit
701Task interactions and blocking
702This far, tasks have been viewed as independent. In reality they are not, as they often share resources. Thus we have to add the possibility for tasks to be suspended; waiting for a future event (e.g. an unlocked semaphore).
703To make equation (2) work with the expanded model, we add a term Bi which is the maximum blocking time of task i.
704wn+1i=Ci+Bi+∑j∈hp(i)⌈wniTj⌉Cj
705Bi=∑Kk=1usage(k,i)C(k)
706See the book page 385 for explanation on the function to calculate Bi.
707Edit
708Priority ceiling protocols
709Priority ceiling protocols are used to manage and synchronize shared resources so that deadlocks and priority inversions are avoided and blocking is minimized.
710Edit
711Original ceiling priority protocol(OCPP):
712 • All tasks have a static default priority
713 • Resources have a static ceiling value, that corresponds to the max priority of the tasks using the resource
714 • Tasks have a dynamic priority; its own static priority or any (higher) priorities inherited from blocking higher-priority tasks
715 • A resource may only be locked by tasks with higher priority than any currently locked tasks
716We see that the priority of a task will be increased if it is blocking a higher priority task.
717Edit
718Immediate ceiling priority protocol(ICPP)
719 • All tasks have a static default priority
720 • Resources have a static ceiling value, that corresponds to the max priority of the tasks using the resource
721 • Tasks have a dynamic priority; the max of its own static priority and the ceiling value of any resources it had locked
722Instead of waiting until it actually blocks a higher priority task, ICPP increases the priority of a task as soon as in locks a resource. ICPP will not start executing until all its needed resources are free, since unavailable resouces it needs indicates that a task of higher priority is running. ICPP is easier to implement, leads to less switching between tasks, but causes more change in priority compared to OCPP.
723Edit
724Ceiling protocols, mutual exclusion and deadlock
725Edit
726An extendible task model for FPS
727So far we have extended the FSP model to include D<T , sporadic/aperiodic tasks and task interaction. This chapter seeks to expand the model by including:
728Edit
729Priority assignment
730If a task p is assigned to the lowest priority and is feasible, then if a feasible priority ordering exists for the complete task set, an ordering exists with task p assigned the lowest priority
731Edit
732Insufficient priorities
733If we have insufficient priorities tasks must share priorities, and thus assume interference.
734Edit
735Execution-time servers
736A virtual resource layer between the applications and the processor(s). Guarantees level of service and resource allocation. Both periodic and sporadic servers act as periodic tasks.
737Edit
738Earliest deadline first (EDF) scheduling
739Alternative to fixed-priority scheduling(FSP), but is less supported by programming languages and operating systems.
740Edit
741Utilization-based schdulability tests for EDF
742Utilization-based test for EDF
743∑Ni=1(CiTi)≥1
744where Ti is the task period and Ci the worst-case execution time for the task.
745Comparison between fixed-priority scheduling(FPS) and earliest deadline first(EDF).
746FPS:
747 • FPS is easier to implement as the scheduling attribute(priority) is static
748 • Easier to incorporate tasks without deadlines by giving them a arbitrary priority.
749 • Using priority as a scheduling attribute can be used to describe more factors.
750 • During overload situations, FPS behaves predictably, with the lower priority task missing their * deadline first.
751EDF:
752 • EDF uses deadlines as scheduling attribute which are dynamic and requires more complex run-time systems.
753 • Giving a task an arbitrary deadline is strictly not the same as a task without a deadline.
754 • Deadlines does not represent the criticality of the task, i.e the task with deadline first may not be the most critical task.
755 • EDF is unpredictable under overload and may experience a domino effect where a large number of tasks miss their deadline.
756 • Higher utilization than FPS
757Edit
758Processor demand criteria for EDF
759When utilization-based schedulability tests cannot be applied,in the case of EDF, when deadlines are shorter than periods. Therefore another way of checking schedulability is through the Processor Demand Critera(PCD) method. In PDC the load of the system h(t), i.e all tasks with deadline before t, is given by
760h(t)=∑i=1N[t+Ti−DiTi]Ci
761where as before Ti is the task period, Di the task deadline and Ci the task computation time.
762A system is schedualable if and only if the load h(t) does not exceed the time to available to complete the given load. More formally
763h(t)≥t,for all t>0
764However one only need to check the t values corresponding to the deadlines of the task. There is also an upper bound L to how many values of t that has to be checked. An unscheduable system will have that
765h(t)>tfor some value oft<L
766Edit
767The QPA test
768For non-trivial systems, i.e systems with large task sets, the upper bound L may be large and the number of deadlines that need to be checked can become very large. Reducing the number of time-points need to be checked can be done through Quick Processor-demand Analysis(QPA) By checking deadlines from L and backwards to 0, only a fraction of all the deadlines have to be checked.
769Let h(L)=s. For s≥L then h(t)<t,s<t<L If s>L then the system is unsheduable,
770Edit
771Blocking and EDF
772Priority inversion in FPS corresponds to deadline inversion in EDF, i.e when a task requires a resource that is in use by another task with a longer deadline. Preventing blocking issues in EDF can be done by using Stack Resource Policy(SRP). SRP assigns a {\bf preemption level } to each task, reflecting the length of the deadline, the shorter the deadline the higher preemtion level.
773At run-time, shared resources are given to the task which has the highest preemption level and then passed on to task with the second highest and so on. Thus tasks are only blocked once(during release) and deadlocks are prevented.
774Edit
775Aperiodic tasks and EDF execution-time servers
776Fill in some text here!
777Edit
778Dynamic systems and online analysis
779In hard real-time systems arrival patterns and computation times are known a priori, but in dynamic soft real-time systems they may not be know As a consequence offline analysis cannot be performed on the system and some kind of online analysis is needed.
780EDF is a dynamic scheduling scheme that performs badly under transient overloads,i.e when a task misses its deadline it starts a domino effect causing the next task to miss its deadline and so on. Therefore an online scheduling scheme is needed to manage any overload that is likely to occur due to the dynamics of the system 's environment
781Most online scheduling schemes uses the following two mechanisms to handle overload:
7821: Admissions control procedure that limits the number of tasks that compete for the processors 2: EDF dispatching routine for those tasks that are admitted by 1
783Admissions control prevent the processor from being overloaded with to many tasks, such that EDF works effectively. Admissions control will admit some tasks and reject some, therefore an overall priority has to be established. This is by giving both admitted and rejected tasks values classified as
784Best-effort
785Hybrid systems
786Edit
787Worst-Case execution time
788In all scheduling approaches described in the book, i.e cyclic executives, FPS and EDF, the worst-case execution time(WCET) is assumed known, but in most cases has to be measured or found through analysis.\newline
789Measuring WCET is problematic as one can not be sure when the worst case as been observed. The drawbacks with analysis is that an effective and accurate model has to be available \newline
790Most analysis techniques uses the following steps to estimate the WCET.
7911: Decompose code to a directed graph of basic blocks, which represent straightline code. 2: Use the processor model to estimate the basic blocks WCET 3: When the WCET is calculated for all basic blocks the cost of moving from one block to the other in directed graph is the WCET of the destination block 4: Calculate the WCET by traversing the graph and eliminating all but the largest paths between blocks.
792This however demands that the code is restricted,i.e. all loops and recursions have to be bounded.
793Edit
794Multiprocessor scheduling
795Scheduling problems for multi-core processor systems are significantly more complicated than single processor systems.Three issues have to be addressed:
7961: the placement of tasks to processors 2: the scheduling of any shared network 3: the implementation of locks in shared-memory multiprocessor architectures.
797Edit
798Global or partitioned placement
799Two task placement schemes are possible:
800Someone fix this table!
801
802Global
803Partitioned
804Advantages
805
806Does not require run-time support and being able to handle systems that the global scheme would finde difficult
807Disadvantages
808Difficult to identify the optimal scheduling policy. For single processors EDF is optimal, but no optimal scheme is known for multiprocessor systems.
809Difficult to generate a valid allocation of tasks. Processors must not be overloaded with tasks. Finding optimal schemes for large number of tasks and processors is not possible with out heuristi
810
811
812Someone fix this table!
813Edit
814Scheduling the network
815For bus-based tightly coupled multiprocessor, the behaviour of multi-level caches makes worst case execution time (WCET) even harder to analyse. With network-communication the messages must be scheduled as well for reliable end to end communication. There are however several network schemes more suitable for timing analysis, e.g
816TDMA is only used to static task allocation, as no two processors should ever produce messages at the same time.
817Edit
818Mutual exclusion on multiprocessor platforms
819Implementing mutual exclusion(Mutex) in multiprocessor shared memory systems requires locks that can be contended for globally However, when a lock is requested by a task on another processor, but not granted. The task requesting task will then busy-wait on the lock waiting for it to become free.
820A system with many globally shared objects and nested usage patterns(i.e task accessing one object while holding the lock to other objects) will be prone to deadlocks and livelocks using global locks.
821Therefore the use of lock-free algorithms is a better alternative. Creating multiple copies where many are read from but only one is updated is way of implementing a lock-free algorithm.
822Edit
823Scheduling for power-aware systems
824Power-aware systems are often systems that run on batteries and they often use variable speed processors to save power. If the frequency of the processor is lowered, the voltage required for stable operation is also lowered considerably. Because power consumption scales exponentially with the voltage used, running on frequencies requiring half the voltage can effectively quadruple battery life. This however rises a different scheduling problem, at what speed must the processor run when executing a set of tasks for the system(set of tasks) to be scheduable?
825In order to answer the question above we must first consider the following questions
826 1. Is the system scheduable when the processor is running at maximum speed?
827 2. If the system is scheduable, by what maximum factor k can all the computations times (Ci) be increased such that the system is still remains sheduable?
828Edit
829Incorporating system overheads
830So far, the overheads of implementing a multitasking scheduling system has been ignored. However in real systems the following has to be taken into consideration
831 1. The cost of a context switch(the cost of moving task between processors) is not negligible.
832 2. All context switch operations are non-preemptive
833 3. The cost of handling an interrupt and releasing an application sporadic task is not insignificant.
834 4. A clock interrupt could result in periodic tasks being moved from a delay queue to dispatch/ready queue.
835Edit
836Transactions Fundamentals
837ACID properties: Properties of an atomic action.
838 • Atomicity: The transaction completes successfully (commits) or if it fails (aborts) all of its effects are undone (rolled back).
839 • Consistency: Transactions produce consistent results and preserve the internal consistency of the data it acts on.
840 • Isolation: Intermediate states produced while a transaction is executing are not visible to others. Furthermore, transactions appear to execute serially, even if they are actually executed concurrently.
841 • Durability: The effects of a committed transaction are never lost.
842Edit
843Atomicity
844Edit
845Consistency
846A transactional application should maintain the consistency of the resources that it uses. Unlike the other ACID properties, this is something the transaction cannot achieve by itself since it does not possess any semantic information about the resources it manipulates and it is therefore the programmer's responsibility to ensure consistency.
847Edit
848Isolation (Serializability)
849Concurrency Control Mechanisms: To ensure serializability.
850 • Two-Phase Concurrency Control: All operations on an object is of type "read" or "write". Many computations can hold a "read-lock", which is associated with an object by a computation before reading it. However, if someone wants to change (insert/modify/delete) the object, this requires a "write-lock". This can not be obtained if any of the other operations hold either a read or write lock on that object. This is to make sure that no one changes something others are using in their operations. If you have obtained a write-lock, this will block all other requests for read and write until you release the lock. All computations must follow a two-phase locking policy. In phase 1, the growing phase, locks are acquired, but not released. In phase 2, the shrinking phase, locks are released but can't be acquired. To avoid the cascade roll back problem (see under), all locks are released simultaneosly in the shrinking phase.
851 • Pessimistic Concurrency Control}: When a resource is accessed, a lock is obtained on it. The lock will remain held on that resource for the duration of the transaction. Drawbacks with this are:
852 ◦ Deadlocks
853 ◦ Concurrency may be limited by the granularity of the locks
854 ◦ The overhead of acquiring and maintaining concurrency control information in an environment where conflict or data sharing is not high.
855 • Optimistic Concurrency Control: Locks are only acquired at the end of the transaction when it is about to terminate. When updating a resource, the operation needs to check if this conflicts with updates that may have occurred in the interim, using timestamps. Most transaction systems that offers optimistic concurrency control will roll back if a conflict is detected, which may result in a lot of work being lost.
856 • Type-Specific Concurrency Control: Concurrent read/write or write/write operations are permitted on an object from different transactions provided these operations can be shown to be non-interfering. Object-oriented systems are well suited to this approach, since semantic knowledge about the operations of objects can be exploited to control permissible concurrency within objects.
857 • Deadlock Detection and Prevention: If a deadlock is detected, the only way to resolve it for at least one of the transaction to roll back. Two popular techniques for deadlock detection is:
858 ◦ Timeout-based: A transaction waits a specified period of time, then rolls back assuming there is a deadlock.
859 ◦ Graph-based: Tracks waiting transaction dependencies by constructing a waits-for graph: nodes are waiting transactions and edges are waiting situations. Guaranteed to detect all deadlocks, but may be costly to execute in a distributed environment.
860Edit
861Durability
862Edit
863Durability
864Any state changes that occur during the transaction must be saved in such a manner that a subsequent failure will not cause them to be lost, for example will a database typically maintain its state in disk in case of machine failure.
865Edit
866Two-Phase Commit Optimizations
867There are several variants in the standard two-phase commit protocol (used to achieve atomic outcomes for transactions).
868Edit
869Synchronization
870Synchronization are informed that a transaction is about to commit, when it has completed and which state it has completed.
871Synchronization essentially turn the two-phase commit protocol into a four-phase protocol. Before the transaction starts the two-phase commit, and all registered Synchronizations are informed. Then they handle from the implementation. A typically implementation will flush a cached copy of some transactional object state to a database, where its ultimate fate will be controlled by a participant in the two-phase commit protocol. Any failure at this state will cause the transaction to roll back. The coordinator then conducts the normal two-phase commit protocol.(This does not involve the Synchronizations.) When the transaction coordinator has finished the two-phase commit protocol, the Synchronizations are informed.
872Edit
873Heuristic Transactions
874One of the most significant type of failure can occur in a transaction system. Heuristic transactions can give you or the system a lot of work to resolve, and are unavoidable.
875The two-phase commit protocol is blocking to guarantee atomicity. If a failure occurs, participants may remain blocked, even if failure recovery mechanisms exit. Some applications and participants cannot tolerate this blocking(same reasons as deadlock). To break the blocking, participants that are past the prepare phase are allowed to make decisions as to whether commit or rollback. The choice will depend on the participant implementation and application/environment. Must record this decision in case it has to compete the original transaction. Sometimes the coordinator informs the participant that the outcome and the decision is not the same. Then a heuristic outcome has happened with a corresponding heuristic decision.
876Two levels of heuristic interaction:
877Possible heuristic outcome(the worst type of failure):
878Heuristic decision should be used with care and only in exceptional circumstances, since there is a possibility that the decision will differ from the determined, that will lead to loss in the system.
879Edit
880The Transaction Log
881To guarantee atomicity in the presence of failure, it is necessary for the transaction service itself to maintain state. \textbf{Transaction log}: registration of the participants and where they have reached in the protocol. After the coordinators decision has been conducted, the participant log can be deleted and informed the coordinator.
882Edit
883Failure Recovery
884Failures can occur both centralized and distributed. More components leads to greater chance of failure. A distributed system failure are often independent.
885Transactions are good tool to ensure consistency in the presence of concurrent users, and have also an excellent fault-tolerance mechanism. When using a transaction you get the benefits of atomicity despite failures (All or nothing happens). The failures can affect the transaction system. Without failure recovery and fault-tolerance capability, the benefits a transaction system offers are seriously downgraded.
886In order to cope with failure, transaction service implementations must possess a failure recovery subsystem. The subsystem ensures that results of a transaction are applied consistently to all resources affected by the transaction. The original application does not need to restart in order to perform the recovery.
887Recovery after failure requires the informations from the transaction log, that is held in some durable state-store. The information will be used to recreate the transaction and the recovery subsystem will continue to complete the transaction.
888If the participants waited for recovery to be driven from the coordinator, then there are some issues. An example is if the coordinator also failed and recovered, it can take some time before the recovery subsystem gets around recovering the specific transaction.
889Most transaction systems require a failure recovery component in both the coordinator and participant machines. Implementing recovery is extremely complex. Most transaction service implementations do not provide it.
890Edit
891Trust Implications
892The trust between the parties involved in the transaction is often overlooked. In any transaction there are three actors:
893 • The participant that is driven through the two-phase protocol.
894 • The coordinator, with which participants register and that drives them through the commit protocol.
895 • The application logic that tells the coordinator whether it wants to commit or rollback the transaction.
896Whatever the application logic might want is never guaranteed by the coordinator, because failure can occur that forces the transaction to roll back. If the coordinator and participant behave correctly and no failures occur, then whatever the application logic tells the coordinator to do, the participant will follow. The coordinator can only do what it is told to by its users.
897Edit
898Types of Transactions
899The most common form of transaction that is used is a top-level transaction. Top-level transactions are most suitably viewed as "short-lived" entries, performing stable state changes to the system. They are less suited for structuring "long-lived" application functions, which may reduce the concurrency in the system to an unacceptable level.