I still see the problem in 3.9 update 6690, which is the newest version on ftp.squeak.org. "load updates" doesn't do anything in this image -- it reports 0 new updates. The virtual machine is the unix 3.8a1 machine.
Further, in addition to the EventSensor problem, there are problems in SharedQueue itself.
IMHO, SharedQueue ought to be a rock-solid building block that Squeak provides to programmers. Currently, it is something that programmers might want to avoid due to its bugs. This disconnect is what led me to work on this issue even though my Squeak-hacking time is extraordinarily limited. Basic things should work.
Thus the same strategies seem to be indicated:
- Transition to a Monitors-based SharedQueue, starting by including the attached fileout into standard images.
- Use the new SharedQueue in EventSensor.
- In general, use Monitor instead of Semaphore in the future, now that Nathanael has contributed it.
The detailed reasoning is in the email that is appended below.
from http://macos.tuwien.ac.at:9009/1033662992.asHtml.full [^] :
Date: Sat, 25 Jun 2005 14:50:50 -0400
From: "Lex Spoon" <email@example.com>
Subject: Re: SharedQueue issues
> I believe the problem here is that flushAllSuchThat: is now waiting
> on readSynch, but the signal it needs
> is waiting on accessProtect (pending nextPut:) before doing the
> readSynch signal.
> Classic deadlock situation.
What a bear. It required an extra cup of coffee to figure out how
SharedQueue is even trying to work, much less how to fix it. This
message has my analysis of the situation, but the punchline is very
simple: let's replace SharedQueue with a new class as described in issue
0001375. Semaphore is just too hard to use correctly.
The problem John's detective work has found is that flushAllSuchThat: is
grabbing accessProtect, but other methods such as #next remain free to
decrement readSynch even though they don't hold accessProtect. This
leads to deadlocks.
That's the short of it. I have written up a longer description of the
current implementation's strategy down below. It might make a good
addition to the class comment for SharedQueue, if we decide to keep the
current implementation around. One thing that gave me pause, by the
way, is that the "flush" methods do not do what I think of as "flushing"
-- instead, they are "removing" items. "Flush" normally means to push
data downstream, not to destroy data completely. It may be better to
name these methods with "remove" instead of "flush".
The problem John has found applies more widely than just
flushAllSuchThat:--it also impacts nextOrNilSuchThat: and nextOrNil.
I'll call these methods "the Bad 3". All of the Bad 3 share the habit
of checking readPointer vs. writePointer for available items *before*
waiting on readSynch. This is bad because some other process may have
done the right thing and waited first on readSynch. Thus, one of the
items that the Bad 3 *think* is available based on readPointer vs.
writePointer, may have already been reserved by another process by
waiting on readSynch.
This observation leads to a simple exhibition of the problem that I have
appended below. It uses only next, nextPut:, and nextOrNil. It takes
advantage of the fact that nextOrNil does not decrease readSynch at all
unless it clears it completely to 0; this can be used to cause readSynch to
have more signals on it than there are actual items in the queue.
Here's a solution strategy based on the above analysis. Whenever one
of the Bad 3 wants to consider removing an element, it should first try
to decrease readSynch. If the attempt succeeds, then it has full rights
to the next item in the queue and can procede. If the attempt fails,
then there are no more unreserved items in the queue and the method
should stop its work. After doing their normal work, the Bad 3 methods
should all increase readSynch by the number if items they reserved but
did not actually remove, so that other callers may access those items.
Unfortunately, class Semaphore does not sufficiently support "try to
wait, but it's okay if there aren't any signals available". It has
waitTimeoutSeconds:, which presumably you could use with a timeout of 0,
but that is not quite enough because you cannot tell whether it timed out
or ate a signal. This functionality is key to the solution strategy I
proposed, and so I am giving up on this solution for now.
Given all this analysis, I see three possible ways forward:
1. Make a new SharedQueue based on Nathanael's Monitor implementation.
2. Improve class Semaphore as described above, so that the
all-semaphores approach can be made to work.
3. Come up with another solution strategy that can use the current
Semaphore class. For example, it may work to have all the reading
methods (the Bad 3 and also #next) poll whenever they successfully wait
on readSynch but then discover that no items are available.
My inclination is to try #1..... And 30 minutes later, I have done so.
Check out Mantis bug 0001375. I'll post this message anyway for the
record, in case someone wants to perservere with 0000002 for some reason.
CURRENT IMPLEMENTATION STRATEGY
SharedQueue works like this:
contentsArray: holds the actual items. nil entries are unused (and so,
nil cannot be sent across a SharedQueue). the buffer is NOT circular.
readPointer: the pointer into contentsArray to start reading the next
element. nil's are skipped.
writePointer: the pointer where the next element should be placed.
readSynch: a semaphore holding, roughly, a count of the number of items
available to read. To be precise, it holds a number less than or equal
to the number of available items; if it holds a lower number, then the
delta by which it is lower is the number of calls to #next (and similar
methods) that have registered to remove an item but have not yet removed
accessProtect: a semaphore used for mutual exclusion. it must be held
by code using or modifying contentsArray, readPointer, or writePointer.
==== EXAMPLE CODE ====
"exhibits a problem where nextOrNil does not properly decrease readSynch"
q := SharedQueue new.
q nextPut: 1.
q nextPut: 2. "readSynch has 2 signals now"
q nextOrNil. "readSynch *still* has 2 signals"
[ r1 _ q next ] fork.
[ r2 _ q next ] fork. "the second #next will think there is an item available. BOOM."
(Delay forSeconds: 1) wait.
"the below code finishes the exercise, but the code has already crashed by this point..."
q nextPut: 3.