Anonymous | Login | 12-07-2019 23:42 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ View Advanced ] [ Issue History ] [ Print ] | |||||||||||
ID | Category | Severity | Reproducibility | Date Submitted | Last Update | |||||||
0001375 | [Squeak] Collections | crash | always | 06-25-05 21:37 | 10-03-09 19:34 | |||||||
Reporter | lexspoon | View Status | public | |||||||||
Assigned To | andreas | |||||||||||
Priority | normal | Resolution | fixed | |||||||||
Status | assigned | Product Version | 3.8 | |||||||||
Summary | 0001375: new SharedQueue based on Monitor | |||||||||||
Description |
Attached is a simple implementation of SharedQueue that is based on a Monitor instead of on Semaphore's. It has an emphasis on simplicity of implementation even though it possibly loses performance. In particular, it uses an OrderedCollection instead of managing its own array of events with read and write pointers. This approach seems appropriate given the long history of synchronization problems with this class in Squeak--let's nail this problem once and for all! The class is a subclass of Stream, so that convenient methods such as next: and nextPutAll: are available. To be honest, however, I have not tracked down that all Stream methods make sense; there are likely some extra methods that should be implemented to make this class the most convenient Stream class that it can be. This new implementation deprecates the "flush" methods in favor of methods named "remove", because "flush" has an existing meaning in Squeak that does not mean "remove". This is especially important since the class is now a subclass of Stream. Three basic test cases are included, including a case that exhibits the problem John McIntosh has found with Squeak's current SharedQueue. I have tested this implementation in EventSensor, and so far the image is working fine. It also survives the following crash test that John McIntosh posted to squeak-dev: bad _ [[Sensor flushNonKbdEvents] repeat] forkAt: Processor userBackgroundPriority. bad2 _ [[Sensor eventQueue peek] repeat] forkAt: Processor userBackgroundPriority. (When you are satisfied that your image has not crashed, do "bad terminate. bad2 terminate" to kill these processes.) The attached code names the class simply SharedQueue2 and puts it in category SharedQueue2, but this is not a recommendation! I have done it this way to make it easy to test with, but I suspect that what we really want to do is swap in this new class in place of SharedQueue. To test this class in EventSensor, you should do all-references (alt-N) of SharedQueue, replace the two references that are in class EventSensor, and then do "EventSensor install". |
|||||||||||
Additional Information | ||||||||||||
Attached Files |
![]() ![]() |
|||||||||||
|
![]() |
|
(0002632 - 248 - 248 - 248 - 248 - 248 - 248) laza 09-10-05 12:00 |
A recent 3.9 image survives both crash tests (on Windows). Was the problem fixed elsewhere? Should the new implementation still be considered for inclusion because of better design? Is there a need to have it in 3.8 as an update to prevent crashes? |
(0002633 - 34 - 34 - 34 - 34 - 34 - 34) laza 09-10-05 12:02 |
Reminder sent to: MarcusDenker Marcus, any plans for this in 3.9? |
(0002635 - 6683 - 7817 - 8012 - 8277 - 8277 - 8277) lexspoon 09-10-05 19:09 |
I still see the problem in 3.9 update 6690, which is the newest version on ftp.squeak.org. "load updates" doesn't do anything in this image -- it reports 0 new updates. The virtual machine is the unix 3.8a1 machine. Further, in addition to the EventSensor problem, there are problems in SharedQueue itself. IMHO, SharedQueue ought to be a rock-solid building block that Squeak provides to programmers. Currently, it is something that programmers might want to avoid due to its bugs. This disconnect is what led me to work on this issue even though my Squeak-hacking time is extraordinarily limited. Basic things should work. Thus the same strategies seem to be indicated: - Transition to a Monitors-based SharedQueue, starting by including the attached fileout into standard images. - Use the new SharedQueue in EventSensor. - In general, use Monitor instead of Semaphore in the future, now that Nathanael has contributed it. The detailed reasoning is in the email that is appended below. from http://macos.tuwien.ac.at:9009/1033662992.asHtml.full [^] : Date: Sat, 25 Jun 2005 14:50:50 -0400 From: "Lex Spoon" <lex@cc.gatech.edu> Subject: Re: SharedQueue issues > I believe the problem here is that flushAllSuchThat: is now waiting > on readSynch, but the signal it needs > is waiting on accessProtect (pending nextPut:) before doing the > readSynch signal. > > Classic deadlock situation. What a bear. It required an extra cup of coffee to figure out how SharedQueue is even trying to work, much less how to fix it. This message has my analysis of the situation, but the punchline is very simple: let's replace SharedQueue with a new class as described in issue 0001375. Semaphore is just too hard to use correctly. The problem John's detective work has found is that flushAllSuchThat: is grabbing accessProtect, but other methods such as #next remain free to decrement readSynch even though they don't hold accessProtect. This leads to deadlocks. That's the short of it. I have written up a longer description of the current implementation's strategy down below. It might make a good addition to the class comment for SharedQueue, if we decide to keep the current implementation around. One thing that gave me pause, by the way, is that the "flush" methods do not do what I think of as "flushing" -- instead, they are "removing" items. "Flush" normally means to push data downstream, not to destroy data completely. It may be better to name these methods with "remove" instead of "flush". The problem John has found applies more widely than just flushAllSuchThat:--it also impacts nextOrNilSuchThat: and nextOrNil. I'll call these methods "the Bad 3". All of the Bad 3 share the habit of checking readPointer vs. writePointer for available items *before* waiting on readSynch. This is bad because some other process may have done the right thing and waited first on readSynch. Thus, one of the items that the Bad 3 *think* is available based on readPointer vs. writePointer, may have already been reserved by another process by waiting on readSynch. This observation leads to a simple exhibition of the problem that I have appended below. It uses only next, nextPut:, and nextOrNil. It takes advantage of the fact that nextOrNil does not decrease readSynch at all unless it clears it completely to 0; this can be used to cause readSynch to have more signals on it than there are actual items in the queue. Here's a solution strategy based on the above analysis. Whenever one of the Bad 3 wants to consider removing an element, it should first try to decrease readSynch. If the attempt succeeds, then it has full rights to the next item in the queue and can procede. If the attempt fails, then there are no more unreserved items in the queue and the method should stop its work. After doing their normal work, the Bad 3 methods should all increase readSynch by the number if items they reserved but did not actually remove, so that other callers may access those items. Unfortunately, class Semaphore does not sufficiently support "try to wait, but it's okay if there aren't any signals available". It has waitTimeoutSeconds:, which presumably you could use with a timeout of 0, but that is not quite enough because you cannot tell whether it timed out or ate a signal. This functionality is key to the solution strategy I proposed, and so I am giving up on this solution for now. Given all this analysis, I see three possible ways forward: 1. Make a new SharedQueue based on Nathanael's Monitor implementation. 2. Improve class Semaphore as described above, so that the all-semaphores approach can be made to work. 3. Come up with another solution strategy that can use the current Semaphore class. For example, it may work to have all the reading methods (the Bad 3 and also #next) poll whenever they successfully wait on readSynch but then discover that no items are available. My inclination is to try #1..... And 30 minutes later, I have done so. Check out Mantis bug 0001375. I'll post this message anyway for the record, in case someone wants to perservere with 0000002 for some reason. -Lex CURRENT IMPLEMENTATION STRATEGY SharedQueue works like this: contentsArray: holds the actual items. nil entries are unused (and so, nil cannot be sent across a SharedQueue). the buffer is NOT circular. readPointer: the pointer into contentsArray to start reading the next element. nil's are skipped. writePointer: the pointer where the next element should be placed. readSynch: a semaphore holding, roughly, a count of the number of items available to read. To be precise, it holds a number less than or equal to the number of available items; if it holds a lower number, then the delta by which it is lower is the number of calls to #next (and similar methods) that have registered to remove an item but have not yet removed it. accessProtect: a semaphore used for mutual exclusion. it must be held by code using or modifying contentsArray, readPointer, or writePointer. ==== EXAMPLE CODE ==== "exhibits a problem where nextOrNil does not properly decrease readSynch" q := SharedQueue new. q nextPut: 1. q nextPut: 2. "readSynch has 2 signals now" q nextOrNil. "readSynch *still* has 2 signals" [ r1 _ q next ] fork. [ r2 _ q next ] fork. "the second #next will think there is an item available. BOOM." (Delay forSeconds: 1) wait. "the below code finishes the exercise, but the code has already crashed by this point..." q nextPut: 3. {r1.r2} |
(0002733 - 84 - 84 - 84 - 84 - 84 - 84) MarcusDenker 09-29-05 18:09 |
SharedQueue2 needs the be added to 3.9, then we test it there and migrate over later |
(0002800 - 26 - 26 - 26 - 26 - 26 - 26) MarcusDenker 10-07-05 18:09 |
in 3.9 for further testing |
(0007459 - 155 - 155 - 155 - 155 - 155 - 155) lexspoon 09-28-06 10:47 |
It has been a year, now, and no problems have come to light. We should start migrating to this. All it requires is replacing SharedQueue by SharedQueue2. |
(0012853 - 103 - 109 - 109 - 109 - 109 - 109) Keith_Hodges 12-17-08 03:15 |
So whats the status, now? It has been a while? We need some code to perform the switch over as a task. |
(0012854 - 137 - 181 - 181 - 181 - 181 - 181) Keith_Hodges 12-17-08 03:30 |
"fix begin" Installer mantis bug: 1375 fix: 'M1375-SharedQueue2Enable.1.cs'. WorldState initialize. EventSensor initialize. "fix end" |
Mantis 1.0.8[^]
Copyright © 2000 - 2007 Mantis Group
95 total queries executed. 52 unique queries executed. |