To Thread or Not To Thread?
- June 9, 2006
- Jesse Glick
NetBeans currently uses a very baroque threading model and there are a lot of problems with it: API usability is impaired by the added complexity of locking and multithreaded accesses; the code is difficult to understand; there are occasional race conditions and deadlocks which are hard to solve; performance may be affected by excessive locking and context switches. This document summarizes some directions we might take to solve these problems.
"Threading model" refers to the general system by which an application does things in parallel (or does not do things in parallel): managing threads (a core Java language feature), locks and synchronization, asynchronous tasks, scheduled tasks and timers, the GUI loop, etc.
It is difficult to describe the current state of threading in NetBeans precisely; no one is completely sure how it works. Various aspects of threading were introduced into the code base incrementally, to address particular issues, and there has never been a comprehensive review of the whole architecture.
In a nutshell, besides the GUI loop mandated by AWT & Swing (henceforth: "event thread"), NetBeans also runs a thread for Datasystems and Lookup combined, plus any number of asynchronous threads for various refresh tasks and invoked actions, plus isolated threads for long-running things such as compilation, CVS operations, and so on. With few exceptions, objects are locked using fine granularity.
Some NetBeans threading behaviors applicable to different subsystems, especially those which can be considered "infrastructure", are now given. There is also an API document which attempts to define some of the API-visible behaviors, though this document is not entirely up to date and in some cases is just wrong.
- Module System
Controlled by a read-write lock (
Mutex). GUI display of module information uses a special bean which decouples the module state from the event thread.
- Window System
External interface should be EQ only. Internals mostly EQ, though presumably something else for persistence.
- Dialogs and Wizards
Actual GUI is event thread only, though the displayer logic can run in any thread, for historical reasons; error notification can be called from any thread for ease of use.
Supposedly thread-safe, using complex fine-grained locking. Event delivery is usually synchronous though there isolated modifications are supported (erroneously referred to as "atomic"). A few aspects (e.g. auto refresh or JAR modification detection) are asynchronous.
Supposedly thread-safe, using very complex fine-grained locking. Folder recognition (as well as Lookup, q.v.) occurs in a dedicated thread, though data object recognition can occur in various threads. Event delivery and processing often asynchronous. Relies on Filesystems thread semantics as an underlying layer, but is synchronized with it.
The API proper is essentially single-threaded (like e.g. the Java Collections API), though there is some undocumented synchronization. The implementation using the Services/ folder and Datasystems API, which may be dropped if and when Registry is available, piggybacks on the Datasystems folder recognition thread (q.v.).
Thread-safe with a fair amount of complex fine-grained locking; children hierarchy uses asynchronous processing and a dedicated read-write lock, as well as timeout to collect unexpanded children.
- Looks (proposed)
Appears similar to Nodes—some fine-grained locking, threading model unclear. Callbacks usually occur in the event thread, but sometimes not.
Event thread only; uses a special bridge to access the Nodes children hierarchy, so that
Childrenmethods are sometimes called from the event thread and sometimes not.
Mainly handled in the event thread; actions can choose to fork a task as needed. Action state updates usually happen in the event thread (not clear if this is guaranteed however). Action presentation (name, icon, etc.) is handled in the event thread.
Usually intended to be thread-safe, using fine-grained locking. Difficult to understand since Datasystems and Lookup are currently used as underlying layers. Delayed asynchronous processing is common. Awaiting replacement e.g. using Registry and/or Preferences.
General API operations are mostly thread-safe. Document operations generally follow the Swing provisions: thread safe, with a read lock, and a NetBeans-added write lock (which may be only an isolated, not atomic, section—not clear). Other GUI operations require the event thread. Document opening is complex and always runs asynchronously, even when called synchronously. Actual editor implementation: unknown model, beyond what is required by the API. Code completion queries and updates seem to run asynchronously. Settings follow a very complex threading model.
- Output Window
Apparently thread-safe; internal display logic all in EQ.
- Ant (incl. compilation, execution, etc.)
Thread-safe for the outside caller; almost completely insulated.
- Project infrastructure
Uses a single mutex.
- (Java) source parsing and recreation
Needs investigation. Has some sort of read/write lock and complex rules for avoiding the event queue.
Structural models of other file types (properties files, XML, etc.) probably use fine-grained locking, performing parses off the event thread. Unknown what model is used for event firing, etc.
- General platform infrastructure ("core")
Uses a variety of different models according to what layers it is interacting with. Sometimes uses fine-grained locking or asynchronous processing. Warm-up tasks run during application startup are run asynchronously.
So how much is threading used in NetBeans, anyway?
Running a simple command using a Perl script over all Java sources in the NetBeans trunk as of Jul 07 2003, you can see a list of all packages including how many lines of code the package contains (skipping comments, blank lines, package declarations, and imports), and how many of those lines explicitly use threading-related idioms (including common classes and methods in the Java platform as well as NetBeans threading utilities). The summary also shows packages which use a lot of threading idioms, in both absolute and proportional terms. Casual inspection of matching lines confirms that most of the matches are really related to thread semantics.
The actual list of
threading-related lines for
org.openide.text.CloneableEditorSupport is given as an example.
This class uses threading heavily: synchronization on several different
monitors; asynchronous tasks posted to
including waiting for them to finish, getting notification of them finishing,
and exposing them to outside code; and tasks posted to the event thread.
(According to its CVS history, this class has been patched at least 14 times
to solve threading-related problems.)
Subsystems that appear to use threading heavily, based on the above summary, include many parts of the NetBeans core implementation; the Datasystems API; the VCS Core module; the Debugger module; parts of the Java and XML modules; etc. Generally, packages using threading-related idioms in over 1% of code lines can be considered to be serious users of such semantics; this may not seem like a lot, but remember that such method calls require a lot more mental analysis than regular sequential code and can have large effects on the runtime behavior of other pieces of code.
(The Debugger module seems to show up so heavily at least in part because it
appears to have copied and pasted
RequestProcessor, a complex
threading utility class from the Open APIs, and added private patches.)
NetBeans' bug tracking system records over 600 threading-related bugs, such as deadlocks and race conditions, some of which are have not yet been fixed, or were not found to be reproducible.
Overuse of threads, even in a language like Java that takes care of many details, is dangerous and should be avoided whenever reasonable:
Threading semantics in Java is specific to the operating system, which can increase the testing burden and lead to platform-specific bugs.
Fine-grained object locking code written by any but the most experienced programmers tends to be wrong in one way or another. It is a difficult thing to get right, and hard to tell by looking at the source code if it is right or not—especially when the threading semantics spans class or package boundaries. When multithreading is truly necessary, it is generally less trouble-prone to use coarse-grained locking (a whole subsystem is grabbed, operated on, then released) or to use message-passing style.
Code which does not run synchronously is far harder to write unit tests for, so such tests are often skipped. Bugs in such code are often not reproducible by the developer easily or at all, yet occur in the field. Threading-related bugs range from recoverable minor errors, to possibly corrupt internal data, to complete deadlocks ("grey screen of death") which can cause data loss.
While splitting a job into several parallel threads can make it run faster on machines with several processors (still somewhat uncommon for the client computers that NetBeans runs on), this is only true up to a point, and all multithreading (but especially fine-grained locking) introduces some inherent overhead which can slow down a program. Generally primitive locking (monitors) are fast in modern VMs, but thread context switches can be much slower. NetBeans may suffer from such inefficiencies, though it is difficult to measure this since profilers cannot readily pinpoint it.
Much of the threading behavior in NetBeans was never documented, much less formally modelled or specified. Plug-in modules attempting to use NetBeans APIs must deal with threading at various points, yet the restrictions are poorly documented and understood. Most module developers (in Sun and elsewhere) are anyway not sufficiently experienced with Java threads to use them safely in production code. Even NetBeans core platform developers tend to rely on a mixture of knowledge and experimentation when making substantial code changes affecting thread usage.
Swing is single-threaded and cannot be used from other threads than the AWT event dispatch thread (henceforth just "AWT thread"). NetBeans code running in arbitrary threads must therefore be "replanned" to the AWT thread sometimes. Unfortunately, Swing does not enforce this directly; if you forget, usually things will seem to work, but occasionally a mysterious error will appear, and sometimes there are various disturbances in the keyboard focus (harming accessibility).
It seems that a significant proportion of NetBeans' instability issues and code complexity can be traced to the lack of a simple threading model.
Of course, threads are not present in the language and used in NetBeans for no reason. There are circumstances where they are useful or necessary.
- Network latency
Any IDE operations which use the network—for example, parsing an XML file that might refer to external entities or schemas not available in a local catalog or entity resolver—can be arbitrarily slow, depending on network congestion and availability, and must not block the event thread or other critical work threads. Ideally such operations should also be cancellable from the GUI.
(At least in some cases, e.g. XML parsing, it may be possible to first attempt the parse from the event thread, using an entity resolver that fails when passed a network URL. If the parse succeeds, fine; if not, asynchronously retrieve the network resource(s), cache them temporarily, and retry the parse.)
Access to plain files (using
java.io.File) can however be assumed to be fast. Although some operating systems do permit files to be mounted from local drives which might suffer network latency (typically small), such delays are likely to freeze most applications on the user's computer, even many command shells; NetBeans is not a special case. Users wishing to avoid such delays, if they are really a problem, should use local disks, or network disks that support timeouts, local caches, etc.
- Long-running tasks
Some tasks started by the IDE are naturally lengthy in their operation. For example, running a compilation (including using an Ant script) could take many seconds, minutes, or even hours to complete. It is unacceptable for the application to be unresponsive during this period. Such tasks should therefore be run off the event thread; typically they are given their own dedicated thread or thread group. Long-running tasks should always be cancellable.
NetBeans already has mechanisms to handle this kind of task, for example the compilation and execution engines, which provide independent thread groups; process listing and status; and cancellability. Tasks may or may not need to interact with other NetBeans internals running in more normal threading modes (for example, in the event thread). Ant scripts for example have only minimal interaction with the rest of NetBeans—sending log messages back to the output window to be displayed. Such interactions must manually synchronize with the rest of NetBeans as appropriate.
Control of external (forked) processes typically falls into this category as well. For example, the JPDA debugger needs to send messages to the debugged process and wait for replies. Since the debugged process might be on a remote machine, or might be deadlocked or slow, the IDE should not assume that replies will arrive promptly or at all; some explicit threading is required here.
- GUI responsiveness
A more controversial use of threading is to improve GUI responsiveness in the application, i.e. to avoid small delays (usually measured in hundreds of milliseconds) between a user action and its visible effect.
A special case of this kind of optimization is "warm-up" support, which asynchronously initializes various subsystems just after the application has been displayed and is more or less ready to use. Initialization comprises both class loading as well as other precalculations of data structures and loading of configurations and caches.
While the goal of GUI responsiveness is certainly important, the trade-off of using asynchronous threads to accomplish it merits analysis. If there is a way to achieve and keep GUI responsiveness associated with a piece of functionality without using special threads or nondeterminacy, then that is to be preferred.
Here are some recommendations for what can be done for new code, and what can be changed for existing code in the short term.
Rather than attempting to make every possible fix in one big branch, it is planned to separate the work into logical chunks, developing each in turn and merging to the trunk in a timely manner, so as to get practical feedback on whether the general direction is working well. In addition, this increases the likely number of fixes to make it into earlier promotions or NetBeans releases.
Tentative schedule of work:
- Looks & Nodes in EQ
Lineand other Editor API cleanup
- Locking models & utilities for Registry, Filesystems—draft
- Locking model & utilities for Registry (& global lookup?)—impl
- Locking model & utilities for Filesystems (& FS-Ext)—impl
- Locking model & utilities for Java parsing, …
- TBD; probably will be obsoleted by rewrites of Java infrastructure?
See a separate proposal.
Code should avoid calling
Task.waitFinished() without careful
considerations of the consequences—possible deadlock. Specifically, you
should never call this method if there is any chance that you, the caller,
might be holding some kind of lock which the task might also try to acquire;
deadlock will ensue if you do.
This deadlock has been observed in the thread demo
CloneableEditorSupport.openDocument(), which blocks on a task
spawned in a separate (RP) thread: if the calling thread is holding a write
lock on a
Phadhail, and the editor support impl tries to acquire
a read lock in order to obtain the input stream, a deadlock results. The
current workaround in
PhadhailEditorSupport is to preload the
phadhail contents synchronously under a read lock, then serve up these
contents from the other thread when requested without trying to acquire a
Better would be for
CES.oD() to not spawn a separate thread to
load the document, since it will need to block anyway. Additionally, it is
necessary to handle the case that a nonblocking call to
CES.getDocument() was made independently and that thread is
CES.oD() must not block on completion of that
task unless that task is entirely wrapped in a read lock so that it could not
happen that the task begins, then a writer blocks on the document, then the
task requests the read lock.
Architecture questions for modules should be tightened to require declarations of their threading models: if they use any models (internally or exposed in an API) which are not already declared by some API, this must be described. That way it will be possible to get an overview of threading usage throughout the system by reading the proper architecture documents.
API questions version 1.23 includes this question: online
- "Thread Demo" experiment
- Tracking issue
- Known threading bugs
- Don't make Swing components outside the event thread if holding locks
- #4499547: request assertions re. EQ usage in AWT/Swing components
- Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control (see Chapter 13 for discussion of listener threading bugs)
- The Problem with Threads
Threading kits and techniques:
- "Responsive Applications Use Threads"
- Rethinking Swing Threading
For how to make interactive applications without threading at all (FRP approaches):