Mantis - Squeak
Viewing Issue Advanced Details
6576 Kernel major always 07-24-07 07:28 11-25-08 23:12
andreas  
 
normal  
closed  
fixed  
none    
none 3.10  
0006576: Delay is not thread-safe
Delay is not thread-safe since currently the calling process updates the internal Delay structure itself. If the process gets terminated while doing this (e.g., adding/removing from SuspendedDelays) the whole system is left in an inconsistent state and breaks. Attached changes fix the problem.
It is *very* hard to recreate this problem under controlled circumstances which is the reason why there are no tests accompanying the changes. I've not been able to create the problem on a single machine but we have seen it for months on our servers and with these changes the problem simply went away.
related to 0006588closed  Broken Semaphore>>critical: leads to frozen processes in Delay 
 SafeDelay-2.cs [^] (9,159 bytes) 07-24-07 16:05
 DelayTweaks-ar.cs [^] (3,866 bytes) 08-31-07 03:37
 DelayStartup-ar.cs [^] (1,750 bytes) 10-06-07 01:21

Notes
(0010928)
andreas   
07-24-07 07:29   
"Change Set: SafeDelay
Date: 23 July 2007
Author: Andreas Raab

This change set fixes a set of severe problems with concurrent use of Delay. Previously, many of the delay-internal structures were modified by the calling process which made it susceptible to being terminated in the middle of manipulating these structures and leave Delay (and consequently the entire system) in an inconsistent state.

This change set fixes this problem by moving *all* manipulation of Delay's internal structures out of the calling process. As a side-effect it also removes the requirement of Delays being limited to SmallInteger range; the new code has no limitation on the duration of a delay.

No tests are provided since outside of true asynchronous environments (networks) it is basically impossible to recreate the situation reliably."
(0010931)
andreas   
07-24-07 16:05   
Updated the change set to one including the class definition.
(0010936)
Keith_Hodges   
07-25-07 17:17   
To install this fix use: Installer mantis ensureFix: '5676 make Delay thread safe'.

"fix begin"
Installer mantis bug: 6576 fix: 'SafeDelay-2.cs'.
"fix end"

(0011083)
andreas   
08-31-07 03:38   
I've added a set of tweaks that should go on top of the previous changes:

"Change Set: DelayTweaks-ar
Date: 30 August 2007
Author: Andreas Raab

Some more delay tweaks:
- Ensure proper unscheduling of delays which are terminated in the wait
- Only unschedule delays that are beingWaitedOn
- Bullet-proof handleTimerEvent so that it only polls the clock once (each poll is a potential cause for msecs clock overflow) and cope with an obscure race condition"
(0011091)
edgardec   
09-04-07 11:13   
This now is 7146DelayTweaks-ar.cs and was in updates for 3.10
Thanks Andreas !
(0011095)
edgardec   
09-07-07 09:50   
Reminder sent to: andreas

Andreas:
I sorry to tell i should remove the .cs from updates stream
I found I can't save the image and all freezes.
It's happening on PPC Mac OS X with John VM.,
If you have new files for me to test, I put again when works
(0011096)
andreas   
09-07-07 21:34   
Without some more detail there is nothing I can do. The only thing I can say is that we run this code in our production servers 24/7 and that we run in our production clients and that we run it under load and it works.
(0011100)
edgardec   
09-08-07 10:12   
Reminder sent to: andreas

I take the 7143 image, filed in the two changes sets and again I can't save the image.
This time I try on my old Windows 98 Pc whit SqueakVM-Win32-3.10.4.
I appreciate any leads why you have many prodution things and I can't save.
(0011104)
andreas   
09-11-07 02:08   
I finally got around to look at this problem and it turns out it's a weird problem with (as one would expect in Delay) all sorts of tricky implications. It worked for us only because all of our images are contaminated by artifacts of earlier delay problems. We are currently testing a fix for all the problems; I'll let you know how it goes.
(0011218)
GazzaGuru   
09-28-07 15:14   
I've also noticed that these changes bring back a resumption time bug. Try interrupting [true] whileTrue. Then, after saving the image, try again...
C.f. http://bugs.squeak.org/view.php?id=381 [^]
(0011230)
andreas   
10-01-07 08:30   
Just as an FYI, this still isn't resolved completely. The problem reported by Edgar can be fixed simply by changing system startup/shutdown (turns out that delay is shut down too early which causes a problem when Sensor is shut down) but it seems as if there is still another problem that leads to Delay's AccessProtect mutex being signaled multiple times. I still haven't tracked this one down - I have a suspicion and added a guard but it hasn't fired yet.
(0011235)
edgardec   
10-05-07 10:07   
Reminder sent to: andreas

Could send or put here the right startup order ?
I try to change doing

Smalltalk removeFromStartUpList: Delay.
Smalltalk addToStartUpList: Delay

And with Sensor, but don't work.
Too deep for me :-)
(0011236)
GazzaGuru   
10-05-07 21:48   
Just a test for "allegedly" correct operation..



s _ Semaphore forMutualExclusion.
i _ 1.

"Loop while there's not too many signals on the semaphore (should be
0 or 1)"
[[(s instVarNamed: #excessSignals) < 2] whileTrue: [

"Fork two processes to do things inside the semaphore"
p _ [[true] whileTrue: [s critical: [(Delay forMilliseconds: 10) wait]]]
    forkNamed: 'p', i printString.
    
q _ [[true] whileTrue: [s critical: [(Delay forMilliseconds: 10) wait]]]
    forkNamed: 'q', i printString.

"Increment the counter just to make it easy to ID the processes" i _ i + 1.

"Delay to give the processes a chance to resume and potentially get
into the critical on the Semaphore"
(Delay forMilliseconds: 500 atRandom) wait.
p terminate.
(Delay forMilliseconds: 500 atRandom) wait.
q terminate.

"After terminating the two processes, the excess signals should be 1
on the Semaphore because
nothing is in critical any more, assuming the processes terminated
properly and the unwind happened
as it should, but... "

"In our images with Andreas' recent Delay and Semaphore fixes, we
always get one of these two
error conditions (although normally too many signals). This would
cause deadlock in 'real world' situations"
(s instVarNamed: #excessSignals) = 0 ifTrue: [
    WorldState addDeferredUIMessage: [self inform: 'Too few signals
after ', i printString, ' loops'].
    self halt]].
self inform: 'Too many signals after ', i printString, ' loops'] fork
(0011238)
andreas   
10-06-07 01:22   
DelayStartup-ar should bring order in the Delay startup/shutdown. Also note the changes to 0006588

(0011240)
edgardec   
10-06-07 10:05   
Thanks Andreas !
I put again 7146DelayTweaks-ar.cs and was in updates for 3.10.
Now I could save the image and continue the updates
(0011256)
edgardec   
10-08-07 12:16   
Seems this is causing the "go limbo" on both Mac and Windows XP when I try to run the 'all test" of SUnit.
At this moment I must delete from updates again until was solved.
I also try the related 6588, but still I could complete the all test run.
People wishing this must try the Installer way and not our updates way until Ralph and Andreas agree how to deal this important issue
(0012804)
KenCausey   
11-25-08 23:12   
Harvested as update 7144 and released with Squeak 3.10.