To Thread or Not To Thread?
- Version:
- June 9, 2006
- Author:
- Jesse Glick
- Abstract:
-
NetBeans currently uses a very baroque threading model and there are a lot
of problems with it: API usability is impaired by the added complexity of
locking and multithreaded accesses; the code is difficult to understand;
there are occasional race conditions and deadlocks which are hard to solve;
performance may be affected by excessive locking and context switches. This
document summarizes some directions we might take to solve these problems.
"Threading model" refers to the general system by which an application does
things in parallel (or does not do things in parallel): managing threads (a
core Java language feature), locks and synchronization, asynchronous tasks,
scheduled tasks and timers, the GUI loop, etc.
It is difficult to describe the current state of threading in NetBeans
precisely; no one is completely sure how it works. Various aspects of
threading were introduced into the code base incrementally, to address
particular issues, and there has never been a comprehensive review of the
whole architecture.
In a nutshell, besides the GUI loop mandated by AWT & Swing (henceforth:
"event thread"), NetBeans also runs a thread for Datasystems and Lookup
combined, plus any number of asynchronous threads for various refresh tasks
and invoked actions, plus isolated threads for long-running things such as
compilation, CVS operations, and so on. With few exceptions, objects are
locked using fine granularity.
Some NetBeans threading behaviors applicable to different subsystems,
especially those which can be considered "infrastructure", are now given.
There is also an API
document which attempts to define some of the API-visible
behaviors, though this document is not entirely up to date and in some cases
is just wrong.
- Module System
-
Controlled by a read-write lock (Mutex). GUI display of module
information uses a special bean which decouples the module state from the
event thread.
- Window System
-
External interface should be EQ only. Internals mostly EQ, though presumably something else for persistence.
- Dialogs and Wizards
-
Actual GUI is event thread only, though the displayer logic can run in any
thread, for historical reasons; error notification can be called from any
thread for ease of use.
- Filesystems
-
Supposedly thread-safe, using complex fine-grained locking. Event delivery
is usually synchronous though there isolated modifications are supported
(erroneously referred to as "atomic"). A few aspects (e.g. auto refresh or
JAR modification detection) are asynchronous.
- Datasystems
-
Supposedly thread-safe, using very complex fine-grained locking.
Folder recognition (as well as Lookup, q.v.) occurs in a
dedicated thread, though data object recognition can occur in various
threads. Event delivery and processing often asynchronous. Relies on
Filesystems thread semantics as an underlying layer, but is synchronized
with it.
- Lookup
-
The API proper is essentially single-threaded (like e.g. the Java
Collections API), though there is some undocumented synchronization. The
implementation using the Services/ folder and Datasystems API,
which may be dropped if and when Registry is available, piggybacks on the Datasystems folder
recognition thread (q.v.).
- Nodes
-
Thread-safe with a fair amount of complex fine-grained locking; children
hierarchy uses asynchronous processing and a dedicated read-write lock, as
well as timeout to collect unexpanded children.
- Looks (proposed)
-
Appears similar to Nodes—some fine-grained locking, threading model
unclear. Callbacks usually occur in the event thread, but sometimes not.
- Explorer
-
Event thread only; uses a special bridge to access the Nodes children
hierarchy, so that Children methods are sometimes called from
the event thread and sometimes not.
- Actions
-
Mainly handled in the event thread; actions can choose to fork a task as needed.
Action state
updates usually happen in the event thread (not clear if this is guaranteed
however). Action presentation (name, icon, etc.) is handled in the event
thread.
- Settings
-
Usually intended to be thread-safe, using fine-grained locking. Difficult to
understand since Datasystems and Lookup are currently used as underlying
layers. Delayed asynchronous processing is common. Awaiting replacement e.g.
using Registry and/or Preferences.
- Editor
-
General API operations are mostly thread-safe. Document operations generally
follow the Swing provisions: thread safe, with a read lock, and a
NetBeans-added write lock (which may be only an isolated, not atomic,
section—not clear). Other GUI operations require the event thread.
Document opening is complex and always runs asynchronously, even when called
synchronously. Actual editor implementation: unknown model, beyond what is
required by the API. Code completion queries and updates seem to run
asynchronously. Settings follow a very complex threading model.
- Output Window
-
Apparently thread-safe; internal display logic all in EQ.
- Ant (incl. compilation, execution, etc.)
-
Thread-safe for the outside caller; almost completely insulated.
- Project infrastructure
-
Uses a single mutex.
- (Java) source parsing and recreation
-
Needs investigation. Has some sort of read/write lock and complex rules for avoiding the event queue.
Structural models of other file types (properties files, XML, etc.) probably use fine-grained locking,
performing parses off the event thread. Unknown what model is used for event firing, etc.
- General platform infrastructure ("core")
-
Uses a variety of different models according to what layers it is
interacting with. Sometimes uses fine-grained locking or asynchronous
processing. Warm-up tasks run during application startup are run
asynchronously.
So how much is threading used in NetBeans, anyway?
Running a simple command using a Perl script over all Java sources in the
NetBeans trunk as of Jul 07 2003, you can see a list of all packages including how many
lines of code the package contains (skipping comments, blank lines, package
declarations, and imports), and how many of those lines explicitly use
threading-related idioms (including common classes and methods in the Java
platform as well as NetBeans threading utilities). The summary also shows
packages which use a lot of threading idioms, in both absolute and
proportional terms. Casual inspection of matching lines confirms that most of
the matches are really related to thread semantics.
The actual list of
threading-related lines for
org.openide.text.CloneableEditorSupport is given as an example.
This class uses threading heavily: synchronization on several different
monitors; asynchronous tasks posted to RequestProcessor,
including waiting for them to finish, getting notification of them finishing,
and exposing them to outside code; and tasks posted to the event thread.
(According to its CVS history, this class has been patched at least 14 times
to solve threading-related problems.)
Subsystems that appear to use threading heavily, based on the above summary,
include many parts of the NetBeans core implementation; the Datasystems API;
the VCS Core module; the Debugger module; parts of the Java and XML modules;
etc. Generally, packages using threading-related idioms in over 1% of code
lines can be considered to be serious users of such semantics; this may not
seem like a lot, but remember that such method calls require a lot more mental
analysis than regular sequential code and can have large effects on the
runtime behavior of other pieces of code.
(The Debugger module seems to show up so heavily at least in part because it
appears to have copied and pasted RequestProcessor, a complex
threading utility class from the Open APIs, and added private patches.)
NetBeans' bug tracking system records over 600 threading-related bugs, such as
deadlocks and race conditions, some of which are have not yet been fixed, or
were not found to be reproducible.
Overuse of threads, even in a language like Java that takes care of many
details, is dangerous and should be avoided whenever reasonable:
-
Threading semantics in Java is specific to the operating system, which can
increase the testing burden and lead to platform-specific bugs.
-
Fine-grained object locking code written by any but the most experienced
programmers tends to be wrong in one way or another. It is a difficult thing
to get right, and hard to tell by looking at the source code if it is right
or not—especially when the threading semantics spans class or package
boundaries. When multithreading is truly necessary, it is generally less
trouble-prone to use coarse-grained locking (a whole subsystem is grabbed,
operated on, then released) or to use message-passing style.
-
Code which does not run synchronously is far harder to write unit tests for,
so such tests are often skipped. Bugs in such code are often not
reproducible by the developer easily or at all, yet occur in the field.
Threading-related bugs range from recoverable minor errors, to possibly
corrupt internal data, to complete deadlocks ("grey screen of death") which
can cause data loss.
-
While splitting a job into several parallel threads can make it run
faster on machines with several processors (still somewhat uncommon for the
client computers that NetBeans runs on), this is only true up to a point,
and all multithreading (but especially fine-grained locking)
introduces some inherent overhead which can slow down a program. Generally
primitive locking (monitors) are fast in modern VMs, but thread context
switches can be much slower. NetBeans may suffer from such inefficiencies,
though it is difficult to measure this since profilers cannot readily
pinpoint it.
-
Much of the threading behavior in NetBeans was never documented, much less
formally modelled or specified. Plug-in modules attempting to use NetBeans
APIs must deal with threading at various points, yet the restrictions are
poorly documented and understood. Most module developers (in Sun and
elsewhere) are anyway not sufficiently experienced with Java threads to use
them safely in production code. Even NetBeans core platform developers tend
to rely on a mixture of knowledge and experimentation when making
substantial code changes affecting thread usage.
-
Swing is single-threaded and cannot be used from other threads than the AWT
event dispatch thread (henceforth just "AWT thread"). NetBeans code running
in arbitrary threads must therefore be "replanned" to the AWT thread
sometimes. Unfortunately, Swing does not enforce this directly; if
you forget, usually things will seem to work, but occasionally a mysterious
error will appear, and sometimes there are various disturbances in the
keyboard focus (harming accessibility).
It seems that a significant proportion of NetBeans' instability issues and
code complexity can be traced to the lack of a simple threading model.
Of course, threads are not present in the language and used in NetBeans for no
reason. There are circumstances where they are useful or necessary.
- Network latency
-
Any IDE operations which use the network—for example, parsing an XML file
that might refer to external entities or schemas not available in a local
catalog or entity resolver—can be arbitrarily slow, depending on network
congestion and availability, and must not block the event thread or other
critical work threads. Ideally such operations should also be cancellable
from the GUI.
(At least in some cases, e.g. XML parsing, it may be possible to first
attempt the parse from the event thread, using an entity resolver
that fails when passed a network URL. If the parse succeeds, fine; if not,
asynchronously retrieve the network resource(s), cache them temporarily, and
retry the parse.)
Access to plain files (using java.io.File) can however be
assumed to be fast. Although some operating systems do permit files to be
mounted from local drives which might suffer network latency (typically
small), such delays are likely to freeze most applications on the
user's computer, even many command shells; NetBeans is not a special case.
Users wishing to avoid such delays, if they are really a problem, should use
local disks, or network disks that support timeouts, local caches, etc.
- Long-running tasks
-
Some tasks started by the IDE are naturally lengthy in their operation. For
example, running a compilation (including using an Ant script) could take
many seconds, minutes, or even hours to complete. It is unacceptable for the
application to be unresponsive during this period. Such tasks should
therefore be run off the event thread; typically they are given their own
dedicated thread or thread group. Long-running tasks should always be
cancellable.
NetBeans already has mechanisms to handle this kind of task, for example the
compilation and execution engines, which provide independent thread groups;
process listing and status; and cancellability. Tasks may or may not need to
interact with other NetBeans internals running in more normal threading
modes (for example, in the event thread). Ant scripts for example have only
minimal interaction with the rest of NetBeans—sending log messages back to
the output window to be displayed. Such interactions must manually
synchronize with the rest of NetBeans as appropriate.
Control of external (forked) processes typically falls into this category as
well. For example, the JPDA debugger needs to send messages to the debugged
process and wait for replies. Since the debugged process might be on a
remote machine, or might be deadlocked or slow, the IDE should not assume
that replies will arrive promptly or at all; some explicit threading is
required here.
- GUI responsiveness
-
A more controversial use of threading is to improve GUI responsiveness in
the application, i.e. to avoid small delays (usually measured in hundreds of
milliseconds) between a user action and its visible effect.
A special case of this kind of optimization is "warm-up" support, which
asynchronously initializes various subsystems just after the application has
been displayed and is more or less ready to use. Initialization comprises
both class loading as well as other precalculations of data structures and
loading of configurations and caches.
While the goal of GUI responsiveness is certainly important, the trade-off
of using asynchronous threads to accomplish it merits analysis. If there is
a way to achieve and keep GUI responsiveness associated with a piece of
functionality without using special threads or nondeterminacy, then that is
to be preferred.
XXX
XXX
XXX
Here are some recommendations for what can be done for new code, and what can
be changed for existing code in the short term.
Rather than attempting to make every possible fix in one big branch, it is
planned to separate the work into logical chunks, developing each in turn and
merging to the trunk in a timely manner, so as to get practical feedback on
whether the general direction is working well. In addition, this increases the
likely number of fixes to make it into earlier promotions or NetBeans
releases.
Tentative schedule of work:
-
Looks & Nodes in EQ
-
stalled
-
CloneableEditorSupport cleanup
-
TBD
-
Line and other Editor API cleanup
-
TBD
-
Locking models & utilities for Registry, Filesystems—draft
-
TBD
-
Locking model & utilities for Registry (& global lookup?)—impl
-
TBD
-
Locking model & utilities for Filesystems (& FS-Ext)—impl
-
TBD
-
Locking model & utilities for Java parsing, …
-
TBD; probably will be obsoleted by rewrites of Java infrastructure?
See a separate proposal.
Code should avoid calling Task.waitFinished() without careful
considerations of the consequences—possible deadlock. Specifically, you
should never call this method if there is any chance that you, the caller,
might be holding some kind of lock which the task might also try to acquire;
deadlock will ensue if you do.
This deadlock has been observed in the thread demo
when calling
CloneableEditorSupport.openDocument(), which blocks on a task
spawned in a separate (RP) thread: if the calling thread is holding a write
lock on a Phadhail, and the editor support impl tries to acquire
a read lock in order to obtain the input stream, a deadlock results. The
current workaround in PhadhailEditorSupport is to preload the
phadhail contents synchronously under a read lock, then serve up these
contents from the other thread when requested without trying to acquire a
lock.
Better would be for CES.oD() to not spawn a separate thread to
load the document, since it will need to block anyway. Additionally, it is
necessary to handle the case that a nonblocking call to
CES.getDocument() was made independently and that thread is
already running—CES.oD() must not block on completion of that
task unless that task is entirely wrapped in a read lock so that it could not
happen that the task begins, then a writer blocks on the document, then the
task requests the read lock.
Architecture questions for modules should be tightened to require declarations
of their threading models: if they use any models (internally or exposed in an
API) which are not already declared by some API, this must be described. That
way it will be possible to get an overview of threading usage throughout the
system by reading the proper architecture documents.
API questions version 1.23 includes this question:
online