Mantis - Squeak
Viewing Issue Advanced Details
3488 Collections minor always 04-21-06 03:06 01-06-07 02:32
black  
 
normal  
new 3.9  
open  
none    
none  
0003488: Interval equality and hash are broken
Intervals should be equal when they contain the same elements, which can be true (in the case of an interval of size < 2) even if they don't have the same increment.

This changeset includes tests (that fail, to expose the bugs) and fixes.

Also, before this changeset, Intervals and Arrays could be equal, but would never have the same hash. The fix for this is to delete the old Interval >> #hash method and let both Interval and Array inherit the same #hash from SequenceableCollection. I also modified that #hash so that it is more reasonable for large collections.
child of 0007002new  The Mother of all issues involving interval 
 IntervalBugFix.3.cs [^] (2,755 bytes) 04-21-06 03:06
 SequenceableCollection-hash-improved.1.cs [^] (910 bytes) 04-24-06 02:26
 Interval-equal-patch.1.cs [^] (1,051 bytes) 04-24-06 03:50

Notes
(0004769)
nicolas cellier   
04-22-06 00:55   
OK, this is exactly the bug http://bugs.impara.de/view.php?id=3380 [^]

The bad side is that super hash is quite inefficient on large collections...
It will hash all and every element...
Try
  Time millisecondsToRun: [(1 to: 10000000) hash]
(0004771)
ducasse   
04-22-06 13:29   
So what do we do?
(0004779)
nicolas cellier   
04-24-06 01:17   
Sorry, i did not notice that proposed change set already does that simple thing: limit the number of hashed elements.

The idea is very good. But the implementation is not that great. 5 first elements is not enough in my opinion.

Take the example of String and Symbol (i know they have hash redefined, but the example is speaking).
Imagine you apply such an algorithm: a lot of class and method names are beginning with the same 5 letters and have the same size (313 entries in Smalltalk over 2217 in my 3.9 image, and 11717 symbol over 38406)... Then a lot of entries with same hash code will clash in MethodDictionary and in SystemDictionary, making lookup in these dictionaries become closer to linear search...

I think you can extend this prefix property to other collections.
So what is a better algorithm ?
You need to at least hash the prefix, the suffix and some intermediate elements.
I do not know what is the right number of elements to hash.
I provide 64 as an arbitrary example, hash 16 elements at the beginning, 16 at the end of collection, and the rest sparsely...

(0004781)
nicolas cellier   
04-24-06 03:50   
Unfortunately, some tricky floating point arithmetic defeat Interval equality:

| i1 i2 eps |
eps := (1.0 timesTwoPower: -52).
i1 := 100 to: 102.1 by: (1+eps).
i2 := 100 to: 102.1 by: (1).
{i1 asArray = i2 asArray.
i1 asArray = i1.
i2 asArray = i2.
i1 asArray = i2.
i2 asArray = i1.
i1 = i2.}

you can see that last is false, equality is no more transitive...
This is cured if increment equality test is removed. Could not find a failure example in this case...

Improvement attached.
(0008633)
black   
12-11-06 11:02   
You have a legitimate concern that a bad hash will make dictionaries slow. Unfortunately, a _good_ hash will also make dictionaries slow, if it is expensive to compute.

The right trade-off is not easy to find. I think that the important case is with symbols. However, remember that the very reason that Symbols exist is so that they can use an efficient equality operation, and one would think, an efficeint hash.

However, at some time in the past, someone made strings and symbols containing the same characters equal. This means that their hashes must also be equal. Which means that Symbol hash is slow.

We may as well get rid of symbols ;-)
(0008837)
nicolas cellier   
01-06-07 02:32   
Very true, symbol hash cannot rely anymore on object ID...

But before throwing them, remember that they have another property:
if they were duplicated they would waste a lot of memory in your image !
(each message send in each method uses one, except maybe special selectors...)