Mantis - Squeak
Viewing Issue Advanced Details
6576 Kernel major always 07-24-07 07:28 11-25-08 23:12
none 3.10  
0006576: Delay is not thread-safe
Delay is not thread-safe since currently the calling process updates the internal Delay structure itself. If the process gets terminated while doing this (e.g., adding/removing from SuspendedDelays) the whole system is left in an inconsistent state and breaks. Attached changes fix the problem.
It is *very* hard to recreate this problem under controlled circumstances which is the reason why there are no tests accompanying the changes. I've not been able to create the problem on a single machine but we have seen it for months on our servers and with these changes the problem simply went away.
related to 0006588closed  Broken Semaphore>>critical: leads to frozen processes in Delay 
 SafeDelay-2.cs [^] (9,159 bytes) 07-24-07 16:05
 DelayTweaks-ar.cs [^] (3,866 bytes) 08-31-07 03:37
 DelayStartup-ar.cs [^] (1,750 bytes) 10-06-07 01:21

07-24-07 07:29   
"Change Set: SafeDelay
Date: 23 July 2007
Author: Andreas Raab

This change set fixes a set of severe problems with concurrent use of Delay. Previously, many of the delay-internal structures were modified by the calling process which made it susceptible to being terminated in the middle of manipulating these structures and leave Delay (and consequently the entire system) in an inconsistent state.

This change set fixes this problem by moving *all* manipulation of Delay's internal structures out of the calling process. As a side-effect it also removes the requirement of Delays being limited to SmallInteger range; the new code has no limitation on the duration of a delay.

No tests are provided since outside of true asynchronous environments (networks) it is basically impossible to recreate the situation reliably."
07-24-07 16:05   
Updated the change set to one including the class definition.
07-25-07 17:17   
To install this fix use: Installer mantis ensureFix: '5676 make Delay thread safe'.

"fix begin"
Installer mantis bug: 6576 fix: 'SafeDelay-2.cs'.
"fix end"

08-31-07 03:38   
I've added a set of tweaks that should go on top of the previous changes:

"Change Set: DelayTweaks-ar
Date: 30 August 2007
Author: Andreas Raab

Some more delay tweaks:
- Ensure proper unscheduling of delays which are terminated in the wait
- Only unschedule delays that are beingWaitedOn
- Bullet-proof handleTimerEvent so that it only polls the clock once (each poll is a potential cause for msecs clock overflow) and cope with an obscure race condition"
09-04-07 11:13   
This now is 7146DelayTweaks-ar.cs and was in updates for 3.10
Thanks Andreas !
09-07-07 09:50   
Reminder sent to: andreas

I sorry to tell i should remove the .cs from updates stream
I found I can't save the image and all freezes.
It's happening on PPC Mac OS X with John VM.,
If you have new files for me to test, I put again when works
09-07-07 21:34   
Without some more detail there is nothing I can do. The only thing I can say is that we run this code in our production servers 24/7 and that we run in our production clients and that we run it under load and it works.
09-08-07 10:12   
Reminder sent to: andreas

I take the 7143 image, filed in the two changes sets and again I can't save the image.
This time I try on my old Windows 98 Pc whit SqueakVM-Win32-3.10.4.
I appreciate any leads why you have many prodution things and I can't save.
09-11-07 02:08   
I finally got around to look at this problem and it turns out it's a weird problem with (as one would expect in Delay) all sorts of tricky implications. It worked for us only because all of our images are contaminated by artifacts of earlier delay problems. We are currently testing a fix for all the problems; I'll let you know how it goes.
09-28-07 15:14   
I've also noticed that these changes bring back a resumption time bug. Try interrupting [true] whileTrue. Then, after saving the image, try again...
C.f. [^]
10-01-07 08:30   
Just as an FYI, this still isn't resolved completely. The problem reported by Edgar can be fixed simply by changing system startup/shutdown (turns out that delay is shut down too early which causes a problem when Sensor is shut down) but it seems as if there is still another problem that leads to Delay's AccessProtect mutex being signaled multiple times. I still haven't tracked this one down - I have a suspicion and added a guard but it hasn't fired yet.
10-05-07 10:07   
Reminder sent to: andreas

Could send or put here the right startup order ?
I try to change doing

Smalltalk removeFromStartUpList: Delay.
Smalltalk addToStartUpList: Delay

And with Sensor, but don't work.
Too deep for me :-)
10-05-07 21:48   
Just a test for "allegedly" correct operation..

s _ Semaphore forMutualExclusion.
i _ 1.

"Loop while there's not too many signals on the semaphore (should be
0 or 1)"
[[(s instVarNamed: #excessSignals) < 2] whileTrue: [

"Fork two processes to do things inside the semaphore"
p _ [[true] whileTrue: [s critical: [(Delay forMilliseconds: 10) wait]]]
    forkNamed: 'p', i printString.
q _ [[true] whileTrue: [s critical: [(Delay forMilliseconds: 10) wait]]]
    forkNamed: 'q', i printString.

"Increment the counter just to make it easy to ID the processes" i _ i + 1.

"Delay to give the processes a chance to resume and potentially get
into the critical on the Semaphore"
(Delay forMilliseconds: 500 atRandom) wait.
p terminate.
(Delay forMilliseconds: 500 atRandom) wait.
q terminate.

"After terminating the two processes, the excess signals should be 1
on the Semaphore because
nothing is in critical any more, assuming the processes terminated
properly and the unwind happened
as it should, but... "

"In our images with Andreas' recent Delay and Semaphore fixes, we
always get one of these two
error conditions (although normally too many signals). This would
cause deadlock in 'real world' situations"
(s instVarNamed: #excessSignals) = 0 ifTrue: [
    WorldState addDeferredUIMessage: [self inform: 'Too few signals
after ', i printString, ' loops'].
    self halt]].
self inform: 'Too many signals after ', i printString, ' loops'] fork
10-06-07 01:22   
DelayStartup-ar should bring order in the Delay startup/shutdown. Also note the changes to 0006588

10-06-07 10:05   
Thanks Andreas !
I put again 7146DelayTweaks-ar.cs and was in updates for 3.10.
Now I could save the image and continue the updates
10-08-07 12:16   
Seems this is causing the "go limbo" on both Mac and Windows XP when I try to run the 'all test" of SUnit.
At this moment I must delete from updates again until was solved.
I also try the related 6588, but still I could complete the all test run.
People wishing this must try the Installer way and not our updates way until Ralph and Andreas agree how to deal this important issue
11-25-08 23:12   
Harvested as update 7144 and released with Squeak 3.10.