Mantis Bugtracker
  

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0001375 [Squeak] Collections crash always 06-25-05 21:37 10-03-09 19:34
Reporter lexspoon View Status public  
Assigned To andreas
Priority normal Resolution fixed  
Status assigned   Product Version 3.8
Summary 0001375: new SharedQueue based on Monitor
Description Attached is a simple implementation of SharedQueue that is based on a Monitor instead of on Semaphore's.

It has an emphasis on simplicity of implementation even though it possibly loses performance. In particular, it uses an OrderedCollection instead of managing its own array of events with read and write pointers. This approach seems appropriate given the long history of synchronization problems with this class in Squeak--let's nail this problem once and for all!

The class is a subclass of Stream, so that convenient methods such as next: and nextPutAll: are available. To be honest, however, I have not tracked down that all Stream methods make sense; there are likely some extra methods that should be implemented to make this class the most convenient Stream class that it can be.

This new implementation deprecates the "flush" methods in favor of methods named "remove", because "flush" has an existing meaning in Squeak that does not mean "remove". This is especially important since the class is now a subclass of Stream.

Three basic test cases are included, including a case that exhibits the problem John McIntosh has found with Squeak's current SharedQueue.

I have tested this implementation in EventSensor, and so far the image is working fine.

It also survives the following crash test that John McIntosh posted to squeak-dev:

bad _ [[Sensor flushNonKbdEvents] repeat] forkAt: Processor
userBackgroundPriority.
bad2 _ [[Sensor eventQueue peek] repeat] forkAt: Processor
userBackgroundPriority.

(When you are satisfied that your image has not crashed, do "bad terminate. bad2 terminate" to kill these processes.)

The attached code names the class simply SharedQueue2 and puts it in category SharedQueue2, but this is not a recommendation! I have done it this way to make it easy to test with, but I suspect that what we really want to do is swap in this new class in place of SharedQueue.

To test this class in EventSensor, you should do all-references (alt-N) of SharedQueue, replace the two references that are in class EventSensor, and then do "EventSensor install".
Additional Information
Attached Files  SharedQueue2.st [^] (4,682 bytes) 06-25-05 21:37
 M1375-SharedQueue2Enable.1.cs [^] (2,149 bytes) 12-17-08 03:28

- Relationships
related to 0002714new  SharedQueue2 should (maybe?) replace SharedQueue 

- Notes
(0002632 - 248 - 248 - 248 - 248 - 248 - 248)
laza
09-10-05 12:00

A recent 3.9 image survives both crash tests (on Windows). Was the problem fixed elsewhere? Should the new implementation still be considered for inclusion because of better design? Is there a need to have it in 3.8 as an update to prevent crashes?
 
(0002633 - 34 - 34 - 34 - 34 - 34 - 34)
laza
09-10-05 12:02

Reminder sent to: MarcusDenker

Marcus, any plans for this in 3.9?
 
(0002635 - 6683 - 7817 - 8012 - 8277 - 8277 - 8277)
lexspoon
09-10-05 19:09

I still see the problem in 3.9 update 6690, which is the newest version on ftp.squeak.org. "load updates" doesn't do anything in this image -- it reports 0 new updates. The virtual machine is the unix 3.8a1 machine.

Further, in addition to the EventSensor problem, there are problems in SharedQueue itself.

IMHO, SharedQueue ought to be a rock-solid building block that Squeak provides to programmers. Currently, it is something that programmers might want to avoid due to its bugs. This disconnect is what led me to work on this issue even though my Squeak-hacking time is extraordinarily limited. Basic things should work.

Thus the same strategies seem to be indicated:
- Transition to a Monitors-based SharedQueue, starting by including the attached fileout into standard images.
- Use the new SharedQueue in EventSensor.
- In general, use Monitor instead of Semaphore in the future, now that Nathanael has contributed it.


The detailed reasoning is in the email that is appended below.

from http://macos.tuwien.ac.at:9009/1033662992.asHtml.full [^] :

Date: Sat, 25 Jun 2005 14:50:50 -0400
From: "Lex Spoon" <lex@cc.gatech.edu>
Subject: Re: SharedQueue issues


> I believe the problem here is that flushAllSuchThat: is now waiting
> on readSynch, but the signal it needs
> is waiting on accessProtect (pending nextPut:) before doing the
> readSynch signal.
>
> Classic deadlock situation.

What a bear. It required an extra cup of coffee to figure out how
SharedQueue is even trying to work, much less how to fix it. This
message has my analysis of the situation, but the punchline is very
simple: let's replace SharedQueue with a new class as described in issue
0001375. Semaphore is just too hard to use correctly.

The problem John's detective work has found is that flushAllSuchThat: is
grabbing accessProtect, but other methods such as #next remain free to
decrement readSynch even though they don't hold accessProtect. This
leads to deadlocks.

That's the short of it. I have written up a longer description of the
current implementation's strategy down below. It might make a good
addition to the class comment for SharedQueue, if we decide to keep the
current implementation around. One thing that gave me pause, by the
way, is that the "flush" methods do not do what I think of as "flushing"
-- instead, they are "removing" items. "Flush" normally means to push
data downstream, not to destroy data completely. It may be better to
name these methods with "remove" instead of "flush".

The problem John has found applies more widely than just
flushAllSuchThat:--it also impacts nextOrNilSuchThat: and nextOrNil.
I'll call these methods "the Bad 3". All of the Bad 3 share the habit
of checking readPointer vs. writePointer for available items *before*
waiting on readSynch. This is bad because some other process may have
done the right thing and waited first on readSynch. Thus, one of the
items that the Bad 3 *think* is available based on readPointer vs.
writePointer, may have already been reserved by another process by
waiting on readSynch.

This observation leads to a simple exhibition of the problem that I have
appended below. It uses only next, nextPut:, and nextOrNil. It takes
advantage of the fact that nextOrNil does not decrease readSynch at all
unless it clears it completely to 0; this can be used to cause readSynch to
have more signals on it than there are actual items in the queue.

Here's a solution strategy based on the above analysis. Whenever one
of the Bad 3 wants to consider removing an element, it should first try
to decrease readSynch. If the attempt succeeds, then it has full rights
to the next item in the queue and can procede. If the attempt fails,
then there are no more unreserved items in the queue and the method
should stop its work. After doing their normal work, the Bad 3 methods
should all increase readSynch by the number if items they reserved but
did not actually remove, so that other callers may access those items.


Unfortunately, class Semaphore does not sufficiently support "try to
wait, but it's okay if there aren't any signals available". It has
waitTimeoutSeconds:, which presumably you could use with a timeout of 0,
but that is not quite enough because you cannot tell whether it timed out
or ate a signal. This functionality is key to the solution strategy I
proposed, and so I am giving up on this solution for now.

Given all this analysis, I see three possible ways forward:

        1. Make a new SharedQueue based on Nathanael's Monitor implementation.

        2. Improve class Semaphore as described above, so that the
all-semaphores approach can be made to work.

        3. Come up with another solution strategy that can use the current
Semaphore class. For example, it may work to have all the reading
methods (the Bad 3 and also #next) poll whenever they successfully wait
on readSynch but then discover that no items are available.

My inclination is to try #1..... And 30 minutes later, I have done so.
Check out Mantis bug 0001375. I'll post this message anyway for the
record, in case someone wants to perservere with 0000002 for some reason.


-Lex



CURRENT IMPLEMENTATION STRATEGY

SharedQueue works like this:

contentsArray: holds the actual items. nil entries are unused (and so,
nil cannot be sent across a SharedQueue). the buffer is NOT circular.

readPointer: the pointer into contentsArray to start reading the next
element. nil's are skipped.

writePointer: the pointer where the next element should be placed.

readSynch: a semaphore holding, roughly, a count of the number of items
available to read. To be precise, it holds a number less than or equal
to the number of available items; if it holds a lower number, then the
delta by which it is lower is the number of calls to #next (and similar
methods) that have registered to remove an item but have not yet removed
it.

accessProtect: a semaphore used for mutual exclusion. it must be held
by code using or modifying contentsArray, readPointer, or writePointer.



==== EXAMPLE CODE ====
"exhibits a problem where nextOrNil does not properly decrease readSynch"

q := SharedQueue new.
q nextPut: 1.
q nextPut: 2. "readSynch has 2 signals now"

q nextOrNil. "readSynch *still* has 2 signals"
[ r1 _ q next ] fork.
[ r2 _ q next ] fork. "the second #next will think there is an item available. BOOM."
(Delay forSeconds: 1) wait.

"the below code finishes the exercise, but the code has already crashed by this point..."
q nextPut: 3.
{r1.r2}
 
(0002733 - 84 - 84 - 84 - 84 - 84 - 84)
MarcusDenker
09-29-05 18:09

SharedQueue2 needs the be added to 3.9, then we test it there and migrate over later
 
(0002800 - 26 - 26 - 26 - 26 - 26 - 26)
MarcusDenker
10-07-05 18:09

in 3.9 for further testing
 
(0007459 - 155 - 155 - 155 - 155 - 155 - 155)
lexspoon
09-28-06 10:47

It has been a year, now, and no problems have come to light. We should start migrating to this. All it requires is replacing SharedQueue by SharedQueue2.
 
(0012853 - 103 - 109 - 109 - 109 - 109 - 109)
Keith_Hodges
12-17-08 03:15

So whats the status, now? It has been a while?
We need some code to perform the switch over as a task.
 
(0012854 - 137 - 181 - 181 - 181 - 181 - 181)
Keith_Hodges
12-17-08 03:30

"fix begin"
Installer mantis bug: 1375 fix: 'M1375-SharedQueue2Enable.1.cs'.
WorldState initialize.
EventSensor initialize.
"fix end"
 

- Issue History
Date Modified Username Field Change
06-25-05 21:37 lexspoon New Issue
06-25-05 21:37 lexspoon File Added: SharedQueue2.st
09-10-05 12:00 laza Note Added: 0002632
09-10-05 12:00 laza Status new => feedback
09-10-05 12:02 laza Issue Monitored: MarcusDenker
09-10-05 12:02 laza Note Added: 0002633
09-10-05 19:09 lexspoon Note Added: 0002635
09-29-05 18:09 MarcusDenker Status feedback => resolved
09-29-05 18:09 MarcusDenker Resolution open => fixed
09-29-05 18:09 MarcusDenker Assigned To  => MarcusDenker
09-29-05 18:09 MarcusDenker Note Added: 0002733
09-29-05 18:10 MarcusDenker Assigned To MarcusDenker =>
10-07-05 18:09 MarcusDenker Status resolved => closed
10-07-05 18:09 MarcusDenker Note Added: 0002800
10-07-05 18:10 MarcusDenker Fixed in Version  => 3.9
09-28-06 10:47 lexspoon Status closed => feedback
09-28-06 10:47 lexspoon Resolution fixed => reopened
09-28-06 10:47 lexspoon Note Added: 0007459
12-17-08 03:15 Keith_Hodges Note Added: 0012853
12-17-08 03:28 Keith_Hodges File Added: M1375-SharedQueue2Enable.1.cs
12-17-08 03:30 Keith_Hodges Note Added: 0012854
12-17-08 03:34 Keith_Hodges Status feedback => acknowledged
01-09-09 23:47 Keith_Hodges Status acknowledged => pending
01-09-09 23:51 Keith_Hodges Status pending => testing
01-09-09 23:57 Keith_Hodges Relationship added related to 0002714
01-10-09 03:39 Keith_Hodges Status testing => resolved
01-10-09 03:39 Keith_Hodges Fixed in Version 3.9 => 3.11
01-10-09 03:39 Keith_Hodges Resolution reopened => fixed
01-10-09 03:39 Keith_Hodges Assigned To  => Keith_Hodges
01-10-09 03:41 Keith_Hodges Status resolved => testing
10-03-09 19:34 Keith_Hodges Status testing => assigned
10-03-09 19:34 Keith_Hodges Assigned To Keith_Hodges => andreas


Mantis 1.0.8[^]
Copyright © 2000 - 2007 Mantis Group
95 total queries executed.
52 unique queries executed.
Powered by Mantis Bugtracker