[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

EQUAL, and hash tables, and value/object distinctions

Two remarks on EQUAL and hash tables, from a hash table fan.

First, if EQUAL is going to descend structures and arrays to test
for isomorphism, it should compare hash tables for equality
according to their contents.  That is, it should verify
that two possibly-equal hash tables have the same set of
keys, and that for each key the two stored values are EQUAL.
Key equality testing must be done using each table's
key comparison function.

[Rationale & discussion are below, on the next page.
But first I've got to respond to something else.]

   Date: Wed, 29 Jun 88 18:57:09 PDT
   From: Jon L White <edsel!jonl@labrea.stanford.edu>

       From: goldman@vaxa.isi.edu
       Date: Fri, 24 Jun 88 18:15:43 PDT

       If programmers to extend EQUAL in CLOS style -- which means
       with arbitrary procedural definitions for their new classes -- could 
       EQUAL hash tables be implemented efficiently?  How would they hash 
       values belonging to the new classes?

   As you pointed out in subsequent discussion, a user-supplied EQUAL method,
   whether for CLOS classes or as a defstruct option, leaves open the question
   of mechanically verifying that the alleged predicate is reflexive, symmetric,
   and transitive.  Another major problem with using hash-tables on objects
   with non-default EQUAL methods is that the SXHASH function must be similarly
   extended for these data types; SXHASH must obey the property that
       (equal x y)   imples   (= (sxhash x) (sxhash x))
   See CLtL p285.  At one time, there was a suggestion that hash tables admit 
   a user-supplied equivalence predicate; but it never got very far for just 
   this reason, that the equivalence predicate must be *** paired up** with an
   appropriate "numericalizer" function, and many implementors felt that users 
   wouldn't understand these issues and wouldn't be able to construct up their 
   own versions of sxhash to go with their alleged equivalence predicates. 

Yuck.  Who are these brilliant Implementors, that we may all worship
at their feet?  For they are the curators of such profound knowledge as
       (EQUALITY-PREDICATE X Y)   ==>   (= (HASH-FN X) (HASH-FN Y))

If we mere Users cannot be trusted with such wisdom, how are we to
navigate the Streams of Extensibility, or handle the Objects of
Genericity?  Oh great Implementors, save us, we pray, from overmuch

Seriously, I hear an alarming amount of condescension in the above reasoning.

The above logical implication is ALL THAT IS NEEDED to construct an
appropriate hash function.  It is the sole and fundamental insight which
makes hash tables work; I have relied on it for years, building many a
hash table in various extensible frameworks.  What's the problem of
requiring a hash-table builder from promising that his hash and equality
functions bear the simple relation to each other?  And if he can't
handle that, what chance does he have of building a working stream type,
or making sense out of CLOS?  (Note that many vendors have supplied more
or less hairy stream definition facilities, which usually require
invariants much more complex than the hash invariant above.)

As for "mechanical verification" of the hashing and equality invariants,
that's a red herring.  Since when do we mechanically verify any of the
required properties of functional arguments to Lisp primitives?  (I'm
thinking particularly of the :KEY and :TEST arguments to sequence
functions.)  Just say that unless the supplied functions satisfy the
required conditions "it is an error".

The "pairing up" of equality and hashing could be done cleanly in CLOS
by defining generic equality and hashing methods for particular hash
table types.

   [SXHASH is required to be "good enough" for the EQUAL equivalence releation,
   and for subsets of that relation such as STRING=, but it is unlikely that 
   any vendor makes it "good enough" for, say, EQUALP.  Lucid has EQUALP hash
   tables, but they don't use SXHASH.]

   -- JonL --


Proposal:  EQUAL should check for isomorphism of hash tables.

The effect of this rule is to treat hash tables like lists,
structures, and arrays for equality testing, under the following
reasonable theory of the semantics of lists, structures, arrays,
and hash tables:  The meaning of any such data structure is
determined by a small set of (call them) "access keys" and
a value stored under each access key.  The set of access
keys for each kind of data structure is {CAR,CDR}, a set of
slot access functions, an initial setquence of non-negative integers,
and a set of Lisp objects, respectively.

Other than the differences in these "access keys", all four
of the above data structures are tuple types, in that they
store a finite set of values, each accessible under its own
key.  An equality function which performs an isomorphism check
on any of those tuple types should perform it on all of them,
or have a very specific reason not to.

If two hash tables differ not in their contents but in their
implementation parameters (such as size or equality function),
we must choose whether to compare structure abstractly
(looking only at contents) or concretely (looking also
at implementation parameters).  This question comes up
with respect to array structural equality also.  For
example, do we allow a fill pointer to define an array's
length (i.e., its "access key" set), or do we look at
the length of the underlying allocated storage?  What
about comparing an (array t) having a fill pointer,
with a string?  I believe the people working on this
problem will come to a more or less abstract viewpoint,
and if so, that viewpoint should be applied consistently
to hash tables.

An interesting question arises when two hash tables have
different key equality predicates, and (suppose) we've already
decided to treat equality predicates as implementation details,
and ignore them directly.  Here's one answer:  Declare two
hash tables EQUAL if they include each other.  That is, for
every key K in hash table A, require that
and the same for every key in table B.

Possible specific reasons not to test for hash table isomorphism,
with comments or rebuttals in brackets:
  (1) It's too hard to implement.
      [No, it's actually pretty easy to implement this test
       portably using a MAPHASH over each table; each MAPHASH
       tests one direction of inclusion by performing
  (2) It's too slow to implement.
      [Any recursive descent takes a long time to do.
       But hash tables can probably be made faster to compare
       than corresponding CONS structures, because they can use
       saved key hash values to advantages.  A hash table's
       key set can be hashed by combining all its key hash
  (3) Recursive descent into hash tables is too unexpected.
      [For programmers unfamiliar with hash tables, many of
       their properties are going to be pretty surprising.
       Programmers who understand their nature and use are
       also going to understand their correspondence with
       lists, structures, and (especially) arrays; it is
       these programmers whose expectations we should design for.]
  (4) Hash tables are different because we expect frequent
      side effects on them, and so momentary isomorphism
      between two of them is not significant.
      [This is the best argument against, I think.  See below.]

About (4):  Cons cells, structures, arrays, and hash tables can all be
used either as pure values (that is, no updates after initial
construction, and object identity is not important) or as updatable
objects (updates are expected, and shared object identities are
crucial).  Under current Lisp practice, hash tables are rarely used
as pure values.  (Cons cells usually are, and structures and arrays
occasionally are.)

We may call types which are used as pure values "value-like"
and types which are used as shared objects "object-like".

Notice that EQUAL and EQL differ in their theory of cons cells: EQUAL
treats them as pure values, and EQL as sharable objects.  I believe
a consistent way of accounting for this is to say that the behavior
of EQUAL and EQL differ for types which have this dual nature,
of value-like and object-like.  (For types which are only value-like,
such as complex numbers, EQL compares substructures, just as EQUAL
does.  For types which are only object-like, such as streams,
EQUAL compares object identity, just as EQL does.)

So, if you believe all that about EQUAL and EQL, the question of EQUAL
on hash tables gets down to this:  Are hash tables so object-like that
no one is likely to think of them as pure values?

Consider this:  Hash tables can be used to implement sets,
if you need to optimize the speed of the set-inclusion operator.
Sets are value-like types.  Hash tables can be used to implement
finite functions; if they themselves are hashable and comparable,
one can build higher-order functional systems out of them
(and get memoization as a bonus).

				-- John