Mantis Bugtracker
  

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0006576 [Squeak] Kernel major always 07-24-07 07:28 11-25-08 23:12
Reporter andreas View Status public  
Assigned To
Priority normal Resolution fixed  
Status closed   Product Version
Summary 0006576: Delay is not thread-safe
Description Delay is not thread-safe since currently the calling process updates the internal Delay structure itself. If the process gets terminated while doing this (e.g., adding/removing from SuspendedDelays) the whole system is left in an inconsistent state and breaks. Attached changes fix the problem.
Additional Information It is *very* hard to recreate this problem under controlled circumstances which is the reason why there are no tests accompanying the changes. I've not been able to create the problem on a single machine but we have seen it for months on our servers and with these changes the problem simply went away.
Attached Files  SafeDelay-2.cs [^] (9,159 bytes) 07-24-07 16:05
 DelayTweaks-ar.cs [^] (3,866 bytes) 08-31-07 03:37
 DelayStartup-ar.cs [^] (1,750 bytes) 10-06-07 01:21

- Relationships
related to 0006588closed  Broken Semaphore>>critical: leads to frozen processes in Delay 

- Notes
(0010928 - 837 - 895 - 895 - 895 - 895 - 895)
andreas
07-24-07 07:29

"Change Set: SafeDelay
Date: 23 July 2007
Author: Andreas Raab

This change set fixes a set of severe problems with concurrent use of Delay. Previously, many of the delay-internal structures were modified by the calling process which made it susceptible to being terminated in the middle of manipulating these structures and leave Delay (and consequently the entire system) in an inconsistent state.

This change set fixes this problem by moving *all* manipulation of Delay's internal structures out of the calling process. As a side-effect it also removes the requirement of Delays being limited to SmallInteger range; the new code has no limitation on the duration of a delay.

No tests are provided since outside of true asynchronous environments (networks) it is basically impossible to recreate the situation reliably."
 
(0010931 - 61 - 61 - 61 - 61 - 61 - 61)
andreas
07-24-07 16:05

Updated the change set to one including the class definition.
 
(0010936 - 162 - 218 - 218 - 218 - 218 - 218)
Keith_Hodges
07-25-07 17:17
edited on: 07-25-07 17:19

To install this fix use: Installer mantis ensureFix: '5676 make Delay thread safe'.

"fix begin"
Installer mantis bug: 6576 fix: 'SafeDelay-2.cs'.
"fix end"

 
(0011083 - 471 - 535 - 535 - 535 - 535 - 535)
andreas
08-31-07 03:38

I've added a set of tweaks that should go on top of the previous changes:

"Change Set: DelayTweaks-ar
Date: 30 August 2007
Author: Andreas Raab

Some more delay tweaks:
- Ensure proper unscheduling of delays which are terminated in the wait
- Only unschedule delays that are beingWaitedOn
- Bullet-proof handleTimerEvent so that it only polls the clock once (each poll is a potential cause for msecs clock overflow) and cope with an obscure race condition"
 
(0011091 - 80 - 86 - 86 - 86 - 86 - 86)
edgardec
09-04-07 11:13

This now is 7146DelayTweaks-ar.cs and was in updates for 3.10
Thanks Andreas !
 
(0011095 - 227 - 251 - 251 - 251 - 251 - 251)
edgardec
09-07-07 09:50

Reminder sent to: andreas

Andreas:
I sorry to tell i should remove the .cs from updates stream
I found I can't save the image and all freezes.
It's happening on PPC Mac OS X with John VM.,
If you have new files for me to test, I put again when works
 
(0011096 - 218 - 218 - 218 - 218 - 218 - 218)
andreas
09-07-07 21:34

Without some more detail there is nothing I can do. The only thing I can say is that we run this code in our production servers 24/7 and that we run in our production clients and that we run it under load and it works.
 
(0011100 - 234 - 252 - 252 - 252 - 252 - 252)
edgardec
09-08-07 10:12

Reminder sent to: andreas

I take the 7143 image, filed in the two changes sets and again I can't save the image.
This time I try on my old Windows 98 Pc whit SqueakVM-Win32-3.10.4.
I appreciate any leads why you have many prodution things and I can't save.
 
(0011104 - 341 - 341 - 341 - 341 - 341 - 341)
andreas
09-11-07 02:08

I finally got around to look at this problem and it turns out it's a weird problem with (as one would expect in Delay) all sorts of tricky implications. It worked for us only because all of our images are contaminated by artifacts of earlier delay problems. We are currently testing a fix for all the problems; I'll let you know how it goes.
 
(0011218 - 194 - 200 - 326 - 326 - 326 - 326)
GazzaGuru
09-28-07 15:14

I've also noticed that these changes bring back a resumption time bug. Try interrupting [true] whileTrue. Then, after saving the image, try again...
C.f. http://bugs.squeak.org/view.php?id=381 [^]
 
(0011230 - 458 - 458 - 458 - 458 - 458 - 458)
andreas
10-01-07 08:30

Just as an FYI, this still isn't resolved completely. The problem reported by Edgar can be fixed simply by changing system startup/shutdown (turns out that delay is shut down too early which causes a problem when Sensor is shut down) but it seems as if there is still another problem that leads to Delay's AccessProtect mutex being signaled multiple times. I still haven't tracked this one down - I have a suspicion and added a guard but it hasn't fired yet.
 
(0011235 - 207 - 249 - 249 - 249 - 249 - 249)
edgardec
10-05-07 10:07

Reminder sent to: andreas

Could send or put here the right startup order ?
I try to change doing

Smalltalk removeFromStartUpList: Delay.
Smalltalk addToStartUpList: Delay

And with Sensor, but don't work.
Too deep for me :-)
 
(0011236 - 1495 - 1929 - 1929 - 1929 - 1929 - 1929)
GazzaGuru
10-05-07 21:48

Just a test for "allegedly" correct operation..



s _ Semaphore forMutualExclusion.
i _ 1.

"Loop while there's not too many signals on the semaphore (should be
0 or 1)"
[[(s instVarNamed: #excessSignals) < 2] whileTrue: [

"Fork two processes to do things inside the semaphore"
p _ [[true] whileTrue: [s critical: [(Delay forMilliseconds: 10) wait]]]
    forkNamed: 'p', i printString.
    
q _ [[true] whileTrue: [s critical: [(Delay forMilliseconds: 10) wait]]]
    forkNamed: 'q', i printString.

"Increment the counter just to make it easy to ID the processes" i _ i + 1.

"Delay to give the processes a chance to resume and potentially get
into the critical on the Semaphore"
(Delay forMilliseconds: 500 atRandom) wait.
p terminate.
(Delay forMilliseconds: 500 atRandom) wait.
q terminate.

"After terminating the two processes, the excess signals should be 1
on the Semaphore because
nothing is in critical any more, assuming the processes terminated
properly and the unwind happened
as it should, but... "

"In our images with Andreas' recent Delay and Semaphore fixes, we
always get one of these two
error conditions (although normally too many signals). This would
cause deadlock in 'real world' situations"
(s instVarNamed: #excessSignals) = 0 ifTrue: [
    WorldState addDeferredUIMessage: [self inform: 'Too few signals
after ', i printString, ' loops'].
    self halt]].
self inform: 'Too many signals after ', i printString, ' loops'] fork
 
(0011238 - 98 - 110 - 110 - 229 - 229 - 229)
andreas
10-06-07 01:22
edited on: 10-06-07 01:22

DelayStartup-ar should bring order in the Delay startup/shutdown. Also note the changes to 0006588

 
(0011240 - 133 - 145 - 145 - 145 - 145 - 145)
edgardec
10-06-07 10:05

Thanks Andreas !
I put again 7146DelayTweaks-ar.cs and was in updates for 3.10.
Now I could save the image and continue the updates
 
(0011256 - 384 - 417 - 417 - 417 - 417 - 417)
edgardec
10-08-07 12:16

Seems this is causing the "go limbo" on both Mac and Windows XP when I try to run the 'all test" of SUnit.
At this moment I must delete from updates again until was solved.
I also try the related 6588, but still I could complete the all test run.
People wishing this must try the Installer way and not our updates way until Ralph and Andreas agree how to deal this important issue
 
(0012804 - 55 - 55 - 55 - 55 - 55 - 55)
KenCausey
11-25-08 23:12

Harvested as update 7144 and released with Squeak 3.10.
 

- Issue History
Date Modified Username Field Change
07-24-07 07:28 andreas New Issue
07-24-07 07:28 andreas File Added: SafeDelay.cs
07-24-07 07:29 andreas Note Added: 0010928
07-24-07 07:52 al Issue Monitored: al
07-24-07 16:04 andreas File Deleted: SafeDelay.cs
07-24-07 16:05 andreas Note Added: 0010931
07-24-07 16:05 andreas File Added: SafeDelay-2.cs
07-25-07 17:17 Keith_Hodges Note Added: 0010936
07-25-07 17:19 Keith_Hodges Note Edited: 0010936
08-31-07 03:37 andreas File Added: DelayTweaks-ar.cs
08-31-07 03:38 andreas Note Added: 0011083
09-02-07 22:57 mikevdg Issue Monitored: mikevdg
09-04-07 11:13 edgardec Note Added: 0011091
09-07-07 09:50 edgardec Note Added: 0011095
09-07-07 21:34 andreas Note Added: 0011096
09-08-07 10:12 edgardec Note Added: 0011100
09-11-07 02:08 andreas Note Added: 0011104
09-28-07 15:14 GazzaGuru Note Added: 0011218
10-01-07 08:30 andreas Note Added: 0011230
10-05-07 10:07 edgardec Note Added: 0011235
10-05-07 21:48 GazzaGuru Note Added: 0011236
10-05-07 22:13 matthewf Relationship added related to 0006588
10-06-07 01:21 andreas File Added: DelayStartup-ar.cs
10-06-07 01:22 andreas Note Added: 0011238
10-06-07 01:22 andreas Note Edited: 0011238
10-06-07 10:05 edgardec Note Added: 0011240
10-08-07 12:16 edgardec Note Added: 0011256
12-18-07 15:30 lewis Issue Monitored: lewis
11-25-08 23:12 KenCausey Status new => closed
11-25-08 23:12 KenCausey Note Added: 0012804
11-25-08 23:12 KenCausey Resolution open => fixed
11-25-08 23:12 KenCausey Fixed in Version  => 3.10


Mantis 1.0.8[^]
Copyright © 2000 - 2007 Mantis Group
133 total queries executed.
71 unique queries executed.
Powered by Mantis Bugtracker