Mantis - Squeak
Viewing Issue Advanced Details
5703 Collections minor always 01-05-07 10:39 01-06-07 05:43
wiz  
 
normal  
new 3.9  
open  
none    
none  
0005703: The hash for class Interval seems pretty weak.
Read the code:
"Interval>>hash
    "Hash is reimplemented because = is implemented."

    ^(((start hash bitShift: 2)
        bitOr: stop hash)
        bitShift: 1)
        bitOr: self size"

That seemed like it gave a mighty good chance for hashes to clash.
Especially the bitOr instead of bitXor.
So I tried to satisfy my curiosity by seeing how many matched more than one interval.
Copy the uploaded file into a workspace and see for yourself.

I was surprised with how small a sample needed to be taken in order to show the weakness.

Less than 10% of the intervals had unique hashes. And I was seeing a lot of multiple clashes.


I am not an expert on hashes. So this is here as a heads up and an alert.

If you know enough to evaluate the importance of these results say so.
(in other words if I'm all wet and the strength of this hash doesn't make a difference, I'd appreciate being clued in.)

Yours in curiosity and service, --Jerome Peace
child of 0007002new  The Mother of all issues involving interval 
 Interval-hash-probe.text [^] (1,472 bytes) 01-05-07 10:39

Notes
(0008838)
nicolas cellier   
01-06-07 02:33   
It is weak and non compliant with = (when compared to an array)
see 0003380
also 0003488

(0008840)
wiz   
01-06-07 03:49   
Thanks Nicolas,

The information was informative.

Its not clear from the reports if things were resolved or not. I suspect not.

I think even the simple hash can be improved if the number of elements becomes a strong focus for the hash.
Things with a different number of elements will not be equivalent.
Then check the first and last elements and then what ever else is necessary.
(0008843)
nicolas cellier   
01-06-07 05:43   
Yes proposed patches in http://bugs.impara.de/view.php?id=3488 [^] solve the problems.

1) first last and size are used in Interval comparison
(not increment, in order to avoid floating point bug)

2) Interval uses super hash in order to have same hash code when equal to an array.

3) Super hash is redefined for efficiency in order to not hash each and every element.
(Time millisecondsToRun: [(1 to 1000000) hash])

There is just a discussion about how many elements to be used in super hash.
5 first ? more ? (at beginning and end)
There is no universal efficient hash code... You can always build a counter example. One has to decide for a magic number compromise.

Concerning interval, of course using 3 elements or two elements plus size would be enough to obtain a correct hash distribution... Just replace bitOr: with bitXOr: in current implementation... Unfortunately we must use super hash because of equality with array...