Dealing with errors is a vital activity. Andy Longshaw and Eoin Woods conclude their pattern language.
This is the second part of a paper that presents patterns for handling error conditions in distributed systems. The patterns in the collection are illustrated in Figure 1.
Figure 1 |
Some of the patterns ( Split Domain and Technical Errors , Log at Distribution Boundary , Unique Error Identifier ) were discussed in the first part of the paper in the previous issue. The remaining patterns are covered in this second part. At the end of the paper, a set of proto-patterns is briefly described. These are considered to be important concepts that may or may not become fully fledged patterns as the paper evolves.) were discussed in the first part of the paper in the previous issue. The remaining patterns are covered in this second part. At the end of the paper, a set of proto-patterns is briefly described. These are considered to be important concepts that may or may not become fully fledged patterns as the paper evolves.
Big Outer Try Block
Problem
Unexpected errors can occur in any system, no matter how well it is tested. Such truly exceptional conditions are rarely anticipated in the design of the system and so are unlikely to be dealt with by the system's error handling strategy. This means that these errors will propagate right to the edge of the system and will appear to 'crash' the application if not handled at that point. This may lead to some or all of the information associated with such unexpected errors being lost, leading to difficulties with the rectification of underlying problem in the system.
Context
A distributed system with a largely 'lay' user community, probably using graphical user interfaces. The interface is likely to be very simple: possibly even a 'kiosk style' interface. Users are mostly on remote sites and will not do much to report errors if they can work around them.
Forces
- If an in-depth error report, particularly for a technical error, is presented to an end user, they are unlikely to be able report its content in enough detail for the underlying problem to be unambiguously identified and so the details of the error are likely to be lost.
- If technical errors are presented to users on a regular basis, they will start to ignore them rather than to go through the process of trying to report them and knowledge of the existence, as well as the details, of these errors will be lost entirely.
- Members of the support staff need to be able to associate user problem reports with logged error information, but detailed error information can be very big and it all looks the same to an end user, making it difficult to report.
- We want to avoid having to write code to handle technical errors at multiple layers of the application but this opens up the risk that such errors will 'leak' through to the user.
Solution
Implement a Big Outer Try Block at the 'edge' of the system to catch and handle errors that cannot be handled by other tiers of the system. The error handling in the block can report errors in a consistent way at a level of detail appropriate to the user base.
Implementation
In the system's ultimate client, wrap the top-level invocation of the system in a Big Outer Try Block that will catch any error - domain or technical - propagating up from the rest of the system. The Big Outer Try Block should differentiate between technical errors (such as databases not being available) and domain errors (such as performing business process steps in the wrong order) as suggested in Split Domain and Technical Errors .
Technical errors should be logged for possible use by technical support staff and the user should then be informed that something terrible has happened in general terms, making it clear that what has happened is not related to their use of the system.
A domain error that reaches the Big Outer Try Block is probably a failure in the design of the user interface that resulted in an unanticipated business process state being reached and as such should be treated as a system fault. In such cases, again the error should be logged and a user-friendly message displayed, but in this case the message can include details of the problem encountered, as these details are likely to be meaningful to the user since they relate to the business process that they were performing.
Finally, a totally unpredictable error (such as an exception indicating a resource shortage due to having run out of memory) that reaches the Big Outer Try Block is some form of internal or environmental error that could not be handled at a lower level. As with a technical error, a generic error should be displayed to the user and the details of the error logged locally.
An example of the structure of a Big Outer Try Block 's implementation is shown in Listing 1.
public class ApplicationMain { ... public static void main(String[] args) { try { ApplicationMain m = new ApplicationMain() ; m.initialize() ; m.execute() ; m.terminate() ; } catch(AppDomainException de) { // Domain exceptions shouldn't get to this // level as they should be handled in the // user interface. If they get here, report // the text to the user and log them in a // local log file } catch(AppTechnicalException te) { // Technical exceptions here are probably // user interface problems. Display a // generic apology and log to a local log file } catch(Throwable t) { // Other exception objects must be internal // errors that could not be caught and // handled elsewhere. Display a generic // apology and log to a local log file } } } |
Listing 1 |
Positive consequences
- Error information is never lost because of unexpected errors propagating to the edge of the system and 'leaking out'. The error information is always captured in its entirety to allow it to be retrieved for support and diagnostic purposes.
- Users are never surprised by an application simply stopping or crashing, but are always informed that something has gone wrong in a user comprehensible form.
- If the application does fail in an unexpected way, it always handles this condition in a consistent manner.
- Other parts of the system may have simpler error handling as they do not need to include handling for totally unpredictable errors.
Negative consequences
- The outer catch block needs to be carefully implemented so that exception information from the very wide range of possible exception types that it must handle does not get lost when handling that scenario.
Related patterns
- Implementing Split Domain and Technical Errors makes the implementation of Big Outer Try Block simpler because the different types of error can be easily differentiated and handled differently.
- This pattern can be combined with the Hide Technical Details from Users pattern in order to ensure that suitable messages are reported to the user when the Big Outer Try Block is triggered.
- Big Outer Try Block combines with Log at Distribution Boundary so that the errors that it receives are more relevant and potentially suitable for display to the user.
- This pattern can be combined with Unique Error Identifier in order to ensure that errors logged by the Big Outer Try Block can be clearly identified.
- A Big Outer Try Block is a form of Default Error Handling [Renzel97]
- This concept is also mirrored in the Java idiom Safety Net in [Haase]
Hide Technical Error Detail From Users
Problem
The technical details of errors that occur are typically of no interest to the end-users of a system. If exposed to such users, this error information may cause unnecessary concern and support overhead.
Context
An application with a largely non-technical user community, probably using the system via some sort of graphical interface.
Forces
- If a detailed error report, particularly for a technical error, is presented to an end user, they are likely to find its content incomprehensible.
- If technical errors are presented to end users or the application simply stops or crashes unexpectedly then this is likely to cause a loss of confidence in the application, possibly leading to a reluctance to use it.
- Inconsistent user error reporting makes the system difficult to support as it confuses the users and prevents them reporting problems accurately and consistently.
- Technical errors generally have a lot of information that is useful for support staff but it is irrelevant to the end user.
- If the system under consideration offers a limited capability user interface (such as that offered by a mobile device), the interface may not be capable of reporting detailed error information in a comprehensible manner.
Solution
Implement a standard mechanism for reporting unexpected technical errors to end-users. The mechanism can report all errors in a consistent way at a level of detail appropriate to the different user constituencies who need to be informed about the error.
Known uses
The authors are aware of a number of instances of this pattern in enterprise systems, although none of them are available for public study. Some examples of using this pattern outside the domain of enterprise systems include the following.
- A number of self service web-sites report a generic error message if an internal error occurs, including a unique error identifier that can be used to report the situation to a helpdesk.
- Some intelligent hardware devices respond to errors that occur by displaying a simple error screen (in some cases including a unique error identifier to allow the error to be uniquely identified by the hardware supplier), that instructs the owner to call a telephone hotline in order to obtain assistance.
- The Microsoft Windows error dialog that is displayed when an application encounters an internal error is an example of the use of this pattern.
Implementation
Within the system's user interface implementation, provide a single, straightforward mechanism for reporting technical errors to end-users. The mechanism is almost certainly going to be a simple API call of the general form:
void notifyTechnicalError(Throwable t) ;
The mechanism created should perform two key tasks:
- Log the full technical details of the error that has occurred for possible use by technical support staff.
-
Display a friendly, user-centric message to inform the user that something terrible has happened in general terms, making it clear that what has happened is not related to their use of the system. The user message should include some form of unique identifier to allow the user to easily report what has happened, via some form of helpdesk.
Ideally, the user reporting of the error should be automated in some way (for example using desktop email automation) in order to make the process of reporting as simple as possible and to avoid errors during the process. If the process is automated, this will avoid the problem of users ignoring the errors because reporting them is too much trouble and will ensure accurate reporting of each error.
From the information in the user's error report, a helpdesk can escalate the problem to an administrator who can access detailed error information elsewhere in the system, using the identifier as a key.
Use this mechanism to handle all technical errors encountered by the system's user interface.
Positive consequences
- Users of the system are never presented with technical error information that could confuse or worry them.
- The system becomes easier to support because support staff can correlate fatal system errors with logged information in order to allow them to understand and investigate the problem.
- Error handling in the GUI implementation is simplified and standardized.
Negative consequences
- Concealing all error information from the end-user means that a knowledgeable end-user is powerless to apply their own knowledge to solve the problem. This could mean that a number of avoidable calls are made to helpdesks, that could otherwise be resolved by the users themselves.
- The implementation of this pattern may require the implementation of a reasonably sophisticated error-handling framework and this may be perceived as a significant overhead within the development process.
Related patterns
- This pattern fits very naturally with the Big Outer Try Block to ensure that technical errors are displayed and logged appropriately.
- Using the Log at Distribution Boundary pattern to govern where technical errors are logged ensures that the received are suitable for reporting to the end user and include a suitable unique identifier.
- This pattern can alternatively be combined directly with Unique Error Identifier to ensure that errors can be clearly identified and to mitigate the potential confusion arising from one error causing multiple log entries.
- An Error Dialog [Renzel97] forms part of a strategy to hide errors from users.
Log Unexpected Errors
Problem
Much domain code includes handling of exceptional conditions and is designed to recognize and handle each condition according to a business process definition (typically the offending transaction being rejected or a new domain entity being created). If such routine error conditions are logged, this makes real errors requiring operator intervention difficult to spot.
Context
Where systems are created in organizations with complex domain processing, or systems with a large number of routinely expected error conditions to which the processes specify the response.
Forces
- The system should report errors when they occur so that they can be investigated and fixed but it is easy for serious errors to be hidden under large numbers of spurious or trivial problems. It should be obvious when operator intervention is required.
- If all possible error conditions, including those routinely encountered during normal operation, are reported then log management becomes much more difficult due to the speed at which the error logs fill up.
- Recording lots of errors increases the amount of logging code and the number of error messages that need to be managed, which reduces the maintainability of the system.
Solution
Implement separate error handling mechanisms for expected and unexpected errors. Error conditions that are expected to arise in the course of normal domain processing should not be logged but handled in the code or by the user. Hence, any logged error should be viewed as requiring investigation.
Implementation
Throughout the system's implementation, use two distinct error handling approaches for expected and unexpected errors:
- Log unexpected errors according to the other patterns, such as Log at Distribution Boundary , and put in place a process that ensures the error triggers operator intervention to resolve the situation.
- Do not log expected errors, but handle them as part of the system's normal operation. This may be done in the code itself, maybe by trying different domain logic that may be able to handle the given inputs or scenario or creating alternative domain entities.
Alternatively, the application may interact with the user, inform them of the problem (in appropriate terms - see Hide Technical Detail from Users) and prompt them to re-start part or all of the current operation.
By following these principles, errors such as 'could not connect to database' are not hidden by hundreds of routine error conditions such as 'no such product code' (perhaps caused by a user misreading a code from a piece of product packaging). As the former error is a significant error requiring investigation, while the second is an expected error condition, the former would be logged and the latter handled algorithmically by the business logic, without logging the condition.
One variation on this approach is to log different types of error message to different places. For example, in terms of the application itself a user failing to authenticate may not be worth recording. However, from the system's point of view (i.e. the operating system) the security policy may require all failed authentications to be logged. This is usually resolved by logging different types of errors to different logs, such as the application event log and security event log provided under Windows. Such partitioning allows different logs to be created to serve the needs of different areas of concern. Another example of this is where knowledge of the patterns in which errors occur would be of interest to developers - large numbers of failed searches at a search engine site may indicate a usability problem. However, such errors are not of interest to the operations team who are responsible for keeping the system running. In this case, the expected errors could be logged to a different location where they will not interfere with the operational errors but can be retrieved later by the development team for further analysis.
A second variation is to log different types of error message in one location but to mark each log message with one or more attributes that allow a set of filters to be created to provide the ability to extract various subsets of the log content on demand to support different uses (such as error monitoring versus usability analysis).
Positive consequences
- Errors that appear in logs always indicate exception conditions and so can be used to initiate support and diagnostic activities.
- Spurious messages indicating that expected conditions have occurred do not prevent easy recognition of the occurrence of exceptional conditions.
- Logs do not quickly fill up with spurious messages created as a result of normal operation.
- Application code is simplified as a result of the reduction in the number of log messages that need to be produced.
Negative consequences
- If the recognition of expected errors is not specific or accurate enough then there is the danger of masking or ignoring exceptional conditions, by incorrectly assuming them to be manifestations of expected error conditions.
Related patterns
- You need to ensure that the correct distinction is made between expected errors and exceptional occurrences as described in Make Exceptions Exceptional .
- It may be helpful to classify errors as 'domain' or 'technical' errors, as described in Split Domain and Technical Errors .
- An approach such as 3 Category Logging [Dyson04] can help to make a log filterable.
Make Exceptions Exceptional
Problem
A number of languages include exception handling facilities and these are powerful additions to the error handling toolkit available to programmers. However, if exceptions are used to indicate expected error conditions occurring, then the calling code becomes much more difficult to understand.
Context
Any situation where a language with exception handling built into it is in use.
Forces
- An application should be designed to handle and recover from most domain errors but some unexpected errors will always occur. Examples of the latter include incorrect or missing application data in the database and incorrect or missing values in configuration files.
- The code paths for handling 'recoverable' errors and 'unrecoverable' errors are usually quite distinct so they should be easily differentiated.
- Large numbers of exceptions generated cause problems for the consumer of a class/method - especially in a language that uses checked exceptions.
- We want to avoid convoluted code and algorithm distortion when routine error conditions (such as 'end of list') are encountered.
Solution
Indicate expected domain errors by means of return codes. Only use exceptions to indicate runtime problems such as underlying platform errors or configuration/data errors.
Implementation
When designing the interfaces in your system you should classify errors into two types:
- Conditions that will occur routinely in standard algorithms, which should be indicated by returning a reserved return value. This could be a null pointer, an empty list or a specific return code ( E_INVALID ). An example of this would be returning an empty list from a search operation that did not match any items.
- Conditions that will only occur due to unexpected errors, which should be indicated by raising language exceptions. Examples of such conditions include those caused by underlying platform or network failure (cannot connect to database), incorrect configuration (bad database connection string) or bad application data (customer id - not name - could not be found).
Errors of the first type will be handled as part of the standard business logic in the system. On the other hand, errors of the second type will normally be handled by a combination of logging and exiting the current code block via an exception path.
It is worth briefly exploring the differences and the blurring of the boundaries here through an example. Consider a component in a retail system that offers out two methods to look up product information. One of these methods allows you to look up products either by keyword or wildcard text string and returns a list of matching products. The other method requires a numeric product code such as a barcode and returns the single product matching that code. The component is backed by a database containing all the products stocked by the retailer.
The search by keyword/wildcard has no guarantee of finding a matching product. Typically, the keyword/wildcard will be entered by a user and so could be subject to all forms of data problems such as mis-spelling or unrealistic expectations (e.g. entering "Elton John" when the retailer just sells food - not CDs). Hence, semantically you could expect no products to be returned - this is an expected business condition, however unhelpful it is to the user of the calling application. Having said that, the user can always get the answer they want by trying again - providing sensible input to the search.
On the other hand, there is more of a semantic implication that the method that requires a product code should find something. Unless users of the system are prone to scanning in barcodes from random products they bring into work, any product scanned in store should be in the database: you should not be able to provide a code that cannot be found. In this case, you could justifiably throw an exception as the only way this condition can occur is if there is a problem with the data in your database. Not only can the user not get the right answer by re-scanning the product (same answer each time...), but in terms of the system this situation needs resolving (i.e. the data in the database needs correcting).
Finally, in either case if the component cannot connect to the database for whatever reason a technical exception should be raised (indeed, the underlying platform will probably raise one for you).
Positive consequences
- Application code is simplified as it does not need to include exception handling constructs within normal algorithms.
- Exceptional conditions in code can all be treated as abnormal situations requiring error handling and as such can be handled via a uniform strategy.
Negative consequences
None
Related patterns
- Expected errors should not be logged, as described in Don't Log Business Process Errors , but unexpected errors - whether technical or domain exceptions - should be logged.
- Ward Cunningham's CHECKS pattern language for information integrity [Cunningham] provides a great deal of guidance relating to the design of data validation in user interfaces (i.e. 'Type 1' errors in this pattern's Implementation section).
Proto-Patterns
Ignore Irrelevant Errors
-
Problem
Sometimes technical errors or exceptions do not denote a real problem and so reporting them can just be confusing or irritating for support staff.
-
Solution
Assess what action can be taken in response to an error and only log it if there is a relevant course of action. An example is ThreadAbortException which is raised under ASP.NET whenever you transfer to another page using Server.Transfer() . This is not an error condition - just a side-effect - and so is of no consequence to support staff. Also, you will get lots of these in any busy web-based system.
Single Type for Technical Errors
-
Problem
There are a myriad different technical errors that may occur during a call to an underlying component.
-
Solution
When you create your exception/error hierarchy for your application, define a single error type to indicate a technical error, e.g. SystemError . The definition and use of a single technical error type simplifies interfaces and prevents calling code needing to understand all of the things that can possibly go wrong in the underlying infrastructure. This is especially useful in environments that use checked exceptions (e.g. Java).
References
[Cunningham] CHECKS: A Pattern Language of Information Integrity, http://c2.com/ppr/checks.html
[Dyson04] Dyson 2004 Architecting Enterprise Solutions: Patterns for High-Capability Internet-based Systems, Paul Dyson and Andy Longshaw, John Wiley and Sons, 2004
[Haase] Java Idioms - Exception Handling, linked from http://hillside.net/patterns/EuroPLoP2002/papers.html
[Renzel97] 'Error Handling for Business Information Systems', Eoin Woods, linked from http://hillside.net/patterns/onlinepatterncatalog.htm
Acknowledgements
We'd like to thank our EuroPLoP 2005 shepherd, Ofra Homsky for her thorough and valuable feedback during this paper's review process and our original EuroPLOP 2004 shepherd Bob Hanmer for providing very valuable advice on the original paper, the members of the EuroPLoP 2004 and 2004 workshops, and the members of the OT2004 workshop at which this paper was first presented.