Mantis Bugtracker
  

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0003488 [Squeak] Collections minor always 04-21-06 03:06 01-06-07 02:32
Reporter black View Status public  
Assigned To
Priority normal Resolution open  
Status new   Product Version 3.9
Summary 0003488: Interval equality and hash are broken
Description Intervals should be equal when they contain the same elements, which can be true (in the case of an interval of size < 2) even if they don't have the same increment.

This changeset includes tests (that fail, to expose the bugs) and fixes.

Also, before this changeset, Intervals and Arrays could be equal, but would never have the same hash. The fix for this is to delete the old Interval >> #hash method and let both Interval and Array inherit the same #hash from SequenceableCollection. I also modified that #hash so that it is more reasonable for large collections.
Additional Information
Attached Files  IntervalBugFix.3.cs [^] (2,755 bytes) 04-21-06 03:06
 SequenceableCollection-hash-improved.1.cs [^] (910 bytes) 04-24-06 02:26
 Interval-equal-patch.1.cs [^] (1,051 bytes) 04-24-06 03:50

- Relationships
child of 0007002new  The Mother of all issues involving interval 

- Notes
(0004769 - 242 - 282 - 408 - 408 - 408 - 408)
nicolas cellier
04-22-06 00:55

OK, this is exactly the bug http://bugs.impara.de/view.php?id=3380 [^]

The bad side is that super hash is quite inefficient on large collections...
It will hash all and every element...
Try
  Time millisecondsToRun: [(1 to: 10000000) hash]
 
(0004771 - 17 - 17 - 17 - 17 - 17 - 17)
ducasse
04-22-06 13:29

So what do we do?
 
(0004779 - 1095 - 1173 - 1173 - 1173 - 1173 - 1173)
nicolas cellier
04-24-06 01:17
edited on: 04-24-06 02:13

Sorry, i did not notice that proposed change set already does that simple thing: limit the number of hashed elements.

The idea is very good. But the implementation is not that great. 5 first elements is not enough in my opinion.

Take the example of String and Symbol (i know they have hash redefined, but the example is speaking).
Imagine you apply such an algorithm: a lot of class and method names are beginning with the same 5 letters and have the same size (313 entries in Smalltalk over 2217 in my 3.9 image, and 11717 symbol over 38406)... Then a lot of entries with same hash code will clash in MethodDictionary and in SystemDictionary, making lookup in these dictionaries become closer to linear search...

I think you can extend this prefix property to other collections.
So what is a better algorithm ?
You need to at least hash the prefix, the suffix and some intermediate elements.
I do not know what is the right number of elements to hash.
I provide 64 as an arbitrary example, hash 16 elements at the beginning, 16 at the end of collection, and the rest sparsely...

 
(0004781 - 501 - 597 - 597 - 597 - 597 - 597)
nicolas cellier
04-24-06 03:50

Unfortunately, some tricky floating point arithmetic defeat Interval equality:

| i1 i2 eps |
eps := (1.0 timesTwoPower: -52).
i1 := 100 to: 102.1 by: (1+eps).
i2 := 100 to: 102.1 by: (1).
{i1 asArray = i2 asArray.
i1 asArray = i1.
i2 asArray = i2.
i1 asArray = i2.
i2 asArray = i1.
i1 = i2.}

you can see that last is false, equality is no more transitive...
This is cured if increment equality test is removed. Could not find a failure example in this case...

Improvement attached.
 
(0008633 - 657 - 693 - 693 - 693 - 693 - 693)
black
12-11-06 11:02

You have a legitimate concern that a bad hash will make dictionaries slow. Unfortunately, a _good_ hash will also make dictionaries slow, if it is expensive to compute.

The right trade-off is not easy to find. I think that the important case is with symbols. However, remember that the very reason that Symbols exist is so that they can use an efficient equality operation, and one would think, an efficeint hash.

However, at some time in the past, someone made strings and symbols containing the same characters equal. This means that their hashes must also be equal. Which means that Symbol hash is slow.

We may as well get rid of symbols ;-)
 
(0008837 - 283 - 307 - 307 - 307 - 307 - 307)
nicolas cellier
01-06-07 02:32

Very true, symbol hash cannot rely anymore on object ID...

But before throwing them, remember that they have another property:
if they were duplicated they would waste a lot of memory in your image !
(each message send in each method uses one, except maybe special selectors...)
 

- Issue History
Date Modified Username Field Change
04-21-06 03:06 black New Issue
04-21-06 03:06 black File Added: IntervalBugFix.3.cs
04-22-06 00:55 nicolas cellier Note Added: 0004769
04-22-06 13:29 ducasse Note Added: 0004771
04-24-06 01:17 nicolas cellier Note Added: 0004779
04-24-06 02:13 nicolas cellier Note Edited: 0004779
04-24-06 02:26 nicolas cellier File Added: SequenceableCollection-hash-improved.1.cs
04-24-06 03:50 nicolas cellier Note Added: 0004781
04-24-06 03:50 nicolas cellier File Added: Interval-equal-patch.1.cs
12-11-06 11:02 black Note Added: 0008633
01-06-07 02:32 nicolas cellier Note Added: 0008837
08-21-10 13:11 nicolas cellier Relationship added child of 0007002


Mantis 1.0.8[^]
Copyright © 2000 - 2007 Mantis Group
67 total queries executed.
40 unique queries executed.
Powered by Mantis Bugtracker