Anonymous | Login | 01-20-2021 09:34 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ View Advanced ] [ Issue History ] [ Print ] | ||||||||
ID | Category | Severity | Reproducibility | Date Submitted | Last Update | ||||
0000854 | [Squeak] VM | crash | random | 01-30-05 22:13 | 10-02-05 18:37 | ||||
Reporter | lewis | View Status | public | ||||||
Assigned To | |||||||||
Priority | normal | Resolution | fixed | ||||||
Status | closed | Product Version | 3.9 | ||||||
Summary | 0000854: Long Delay schedules a deferred image crash | ||||||||
Description |
If you schedule a Delay with a duration long enough to be a LargeInteger, it will sit happily in the SuspendedDelays queue until such time as it becomes the next Delay to be activated. At that time, #primitiveSignalAtMilliseconds is called. The primitive is only able to handle SmallIntegers, so it fails, whereupon #primSignal:atMilliseconds: attempts to throw 'a primitive has failed', which bolluxes up the works in the process scheduler. As far as I can tell, this is an unrecoverable error condition. The following illustrates the problem and will usually crash your image: | bigDelay | bigDelay := Delay forSeconds: SmallInteger maxVal + 1. [bigDelay wait. Transcript show: 'I have been waiting a long time for this'; cr] fork. (Delay forSeconds: 1) wait. Smalltalk garbageCollect "=> image will hang after failed primitive" Two things need to be addressed: 1) The primitive should accept large integer arguments. Note that the longest possible delay with a small integer is only about 12 days: Duration seconds: (SmallInteger maxVal / 1000) asInteger => 12:10:15:41 2) Delay class>>primSignal:atMilliseconds: should do something other than #primitiveFailed when the primitive fails. Perhaps writing something to the Transcript would make sense, or just ignore the failure entirely? |
||||||||
Additional Information | This bug exists in all versions of Squeak. | ||||||||
Attached Files |
![]() ![]() ![]() ![]() ![]() |
||||||||
|
![]() |
||||||||||||||||
SYSTEM WARNING: Creating default object from empty value SYSTEM WARNING: Creating default object from empty value
|
![]() |
|
(0001113 - 551 - 575 - 575 - 575 - 575 - 575) lexspoon 01-30-05 22:57 |
Asking the primitive to support long integers seems like a bit much to me. The better solution is make a LongDelay class which can handle huge delays. There would have to be a background process periodically checking for LongDelay's that are close to coming into effect, and schedule a regular Delay if so. That's a lot of work. How about, for now, we have class Delay simply refuse to create long-waiting instances? It's a shame not to support long-running delays, but this would seem much better than the current situation. Patch attached. |
(0001114 - 649 - 673 - 673 - 673 - 673 - 673) lewis 01-30-05 23:12 |
Correction: My suggestion of having the primitive accept LargeInteger objects is a bad idea. The VM is keeping track of milliseconds in an int variable, so at best we could increase the maximum duration of a Delay by a factor of 4, which does not fix anything really. (Lex - funny, I was just thinking about the idea of LongDelay as soon as I realized that using LargeIntegers was a dumb idea. Maybe it's not so hard to do, it could just be something that holds on to a collection of regular Delays until they all expire.) Having class Delay refuse to create instances with LargeInteger durations does sound like the right thing to do for now. |
(0001115 - 251 - 269 - 269 - 269 - 269 - 269) laza 01-30-05 23:28 |
Lex, your fix won't catch too long delay times in the instance creation method forSeconds: in Delay. In my variation of your fix I removed some redundant code and placed the bounds check on the instance side. Hopefully this catches all cases. Alex |
(0001117 - 508 - 520 - 520 - 520 - 520 - 520) tim 01-31-05 02:14 |
We should probably check for a LPI after a prim fail and do the appropriate modulo arithmetic to find out if the delay should fire in the next cycle round the ~12days of milliseconds. Consider using an actual Date as a way to specify the firing time of a really long delay; it would at least make it very explicit that this is a Delay of immense proportions. Debugging would be a bit less hair-tearing if you could see at a glance that the Delay is meant to go off in a thousand years and not yesterday. |
(0001118 - 828 - 987 - 987 - 987 - 987 - 987) lewis 01-31-05 12:44 |
Oops, the proposed bounds check fixes will not work. The #primitiveSignalAtMilliseconds is called with the millisecond clock time at which we want the Delay to end. This means that a Delay of SmallInteger>>maxVal milliseconds will call the primitive with a LPI argument, which will trigger the system hang problem. Note that this is also going to be platform dependent. If there are any platforms that do not roll the millisecond clock every 24 hours, they will behave differently with respect to the longest Delay that can be represented. "CRASHME" | bigDelay | bigDelay := Delay forMilliseconds: SmallInteger maxVal. [bigDelay wait. Transcript show: 'I have been waiting a long time for this'; cr] fork. (Delay forSeconds: 1) wait. Smalltalk garbageCollect "=> image will hang after failed primitive" |
(0001119 - 710 - 914 - 914 - 914 - 914 - 914) laza 01-31-05 18:43 |
primitiveMillisecondClock "Return the value of the millisecond clock as an integer. Note that the millisecond clock wraps around periodically. On some platforms it can wrap daily. The range is limited to SmallInteger maxVal / 2 to allow delays of up to that length without overflowing a SmallInteger." Ok, so in V2 I lowered the upper range check down to SmallInteger maxVal // 2. This seems to work: "DOES NOT CRASH" | bigDelay | bigDelay := Delay forMilliseconds: SmallInteger maxVal // 2 - 1. [bigDelay wait. Transcript show: 'I have been waiting a long time for this'; cr] fork. (Delay forSeconds: 1) wait. Smalltalk garbageCollect "=> image will hang after failed primitive" |
(0001120 - 1068 - 1298 - 1298 - 1392 - 1392 - 1392) lewis 02-01-05 06:18 |
Posted on squeak-dev, with attachment uploaded here: Subject: [BUG][FIX] Long Delay crashes Morphic image If an attempt to schedule a Delay fails, cancel the Delay and schedule and error notification (Morphic only). This corrects a problem in which an invalid Delay results in a locked image when running in Morphic. An invalid Delay can be created by specifying a delay duration that overflows the size of a SmallInteger. For example, the following will crash a Morphic image if this change set has not been applied: | bigDelay | bigDelay := Delay forMilliseconds: SmallInteger maxVal + 1. bigDelay inspect. [bigDelay wait. Transcript show: 'I have been waiting a long time for this'; cr] fork. (Delay forSeconds: 1) wait. Smalltalk garbageCollect This change set corrects the system crash problem reported as 0000854 on Mantis (Subject: [BUG][VM] long Delay schedules a deferred image crash). It does not resolve the remaining problem that Delays of duration longer than about 12 days cannot be scheduled on Squeak. Dave |
(0001121 - 1211 - 1409 - 1409 - 1409 - 1409 - 1409) laza 02-01-05 09:20 |
Dave, I see some problems with your fix. 1) Error reporting is defered until the Delay is scheduled as the active delay | bigDelay | someDelay := Delay forMilliseconds: 24*60*60*1000. bigDelay := Delay forMilliseconds: SmallInteger maxVal + 1. [someDelay wait. Transcript show: 'Waited some time'; cr] fork. [bigDelay wait. Transcript show: 'I have been waiting a long time for this'; cr] fork. With your fix the error will be reported earliest after a day, when someDelay gets unscheduled. 2) It makes the behaviour of Delay non deterministic, because delay times t > (SmallInteger maxVal // 2) will sometimes work and sometimes not, depending on the MillisecondClock value c and t+c<=Smalltalk maxVal I think class Delay should be seen as to be used only for rather short delays, because it is based on a clock with millisecond precision. For anything that should be delayed days or weeks I think we should have some LongDelay class, wich could be based on a coarser clock and cooperates with Delay. Now to be on the safe side of things, should we lower the upper bound for Delays even more down to 24h, because the Millisecond clock on some platforms wraps every 24h? |
(0001122 - 526 - 560 - 560 - 560 - 560 - 560) lewis 02-01-05 12:39 |
Right, DelayCrashFix-dtl is only intended to address the "image hangs" part of the problem. It does not address the problem of large Delays not working properly, and you are right that this is non-deterministic behavior. IMO it is not acceptable to have a failure mode that hangs the system, so both problems should be fixed. With respect to the millisecond clock question, I'm not exactly sure how this works on the various platorms and VMs. Your choice of SmallInteger//2 is probably safe, but I'm really not certain. |
(0001123 - 103 - 103 - 103 - 103 - 103 - 103) laza 02-01-05 13:28 |
I'm not sure either, so I asked for some additional advice from some of the authors of the Delay class. |
(0001127 - 429 - 488 - 488 - 488 - 488 - 488) laza 02-02-05 16:58 |
via email: Hi - Short version: Lex' solution is by far the best. Making the primitive LargeInt aware might be impossible; but one could easily think of having a LongDelay using (short) delays to re-schedule itself when appropriate (in fact, it would be doable to build this into the timer interrupt watcher itself). In the absence of that fix, raising an error is by far preferrable to a VM crash. Cheers, - Andreas |
(0001128 - 329 - 354 - 354 - 354 - 354 - 354) laza 02-02-05 17:06 |
> Right, DelayCrashFix-dtl is only intended to address the "image hangs" part of the problem. Ok, so I build a final changeset (V3) which combines the two approaches (bound check and crash prevention) and limited the longest possible delay time to 24h. Anything above that should be handled by the upcoming LongDelay class! :) |
(0001130 - 1053 - 1098 - 1098 - 1098 - 1098 - 1098) tim 02-03-05 00:00 |
Guys, I think you're getting a bit off track here. There is no need for any extra delay class; a Delay just needs a number and an LPI is fine. Floats and Fraction (and Complex and Tensors etc) are probably a poor idea for now. LPIs sort perfectly well in the delay list. No problems there. What we need is to stop trying to schedule any Delays with non-SmallInt resumption times. Notice that we already have code to re-work the resumtption times when the clock rolls over (see Delay class>timerInterruptWatcher) so long delays will eventually work their way down to 'sensible' values. The problem is in #activate IF the lowest value resumption time is >SmallInteger. At this point the thing to do is to create a substitute Delay with some longish delay and a new semaphore (for it to signal) All that will do is put off the evil moment for a while but during that time other Delays may be made and life should continue as normal. If the long delay gets to the front of the queue again and is still >SMallInteger then te same thing gets done. |
(0001132 - 824 - 848 - 848 - 848 - 848 - 848) lewis 02-03-05 01:56 |
I posted an implementation of LongDelay on the list, and have attached the change sets here also. This seems to do a fairly good job of implementing long delays. I tried to fail-safe it a bit by giving the delay watcher process a nice name so folks will not be tempted to terminate it in the ProcessBrowser. Tim may have a better idea there, need to think about it. But I'm not sure how to do it without adding a Process to watch the delayTheInevitable semaphore. Once you have added a process to wait on the inevitable, I think you have about the same thing that I just did with LongDelay. Notes: Class Monitor is subclassed from Delay, hence any image crash bug does not go away when LongDelay is added. The DelayCrashFix-dtl patch is still required as a safety net, regardless of how we end up handling long delays. |
(0001143 - 1125 - 1179 - 1179 - 1179 - 1179 - 1179) laza 02-12-05 12:51 |
(back from carnival :) Tim, I think we should fix the bug by guarding the primitive against wrong parameters and leave the Delay class as it is. For the majority of use cases the max delay time of about six days should be more than enough. Besides Dave's recent demand for extra long delays, I don't think I have seen any complains in the last seven years about the max Delay being too short. Extending the Delay class to support extra long delays by using substitute delays dosn't sound too good to me. Such a change would also mean a lot of testing (work) to get Delay rock solid and I don't think this is worth the benefits. (And I don't want to be the one who gets tared and feathered if the extended Delay later turns out to break in any way :) I would say the few that *really* need extra long delays can just easily use Dave's changeset (or use a solution of their own). That said, I provide just another fix, which isn't that restrictive. I've tested that even if I use a artificial crippled millisecondClock that wraps about every 8 seconds, delay times from 0 to (SmallInt // 2) do indeed work. *Phew* |
(0001144 - 101 - 101 - 101 - 101 - 101 - 101) laza 02-12-05 12:56 |
I think the issue has been resolved. Just feel free to change the status again if you don't think so. |
(0001719 - 266 - 266 - 266 - 266 - 266 - 266) KenCausey 07-08-05 18:23 |
I'm not sure I would count this issue to be resolved as long as it has not been 'fixed' in any existing image or update. Perhaps this appeared in the now abandoned 3.9a branch? At any rate this does not appear to be fixed in 3.8-6665 and so I'm going to reopen it. |
(0002738 - 303 - 309 - 309 - 309 - 309 - 309) MarcusDenker 09-30-05 09:53 |
putting this on resolved is a good idea: We need to be able to tag those items that are resolved, but not yet harvested. This makes the life of the harvesters *much* easier, because they can focus on that stuff that can be looked at. So: if fixed by a changeset --> resolved. If in the image --> closed. |
(0002739 - 284 - 284 - 284 - 284 - 284 - 284) KenCausey 09-30-05 18:00 |
Yes, OK. I did that back before I'd really considered how we would apply the statuses for our purposes. I agree that I should have simply left it as Resolved and that is the correct status for a report that has an attached (or referenced as being in a public 'inbox') fix that works. |
(0002758 - 16 - 16 - 16 - 16 - 16 - 16) ducasse 10-02-05 18:36 |
in 3.9a script 5 |
Mantis 1.0.8[^]
Copyright © 2000 - 2007 Mantis Group
163 total queries executed. 79 unique queries executed. |