[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

long-char, kanji



I like a lot of what Moon described, with a few reservations.  Let me
first describe what I see as the requirements:

In trying to formulate an international standard for Common Lisp, we
clearly need to deal with this issues of extended character sets.  I'm
assuming that 16 bits of character code is enough to meet everyone's
needs -- is that naive?  How many thousand characters are necessary for
Japanese, and are these the same as the thousands needed for Chinese?
Are there other languages in the computer-using world that have
non-phonetic alphabets with thousands of characters?

I think that we should define some notion of fat characters and the
strings to hold them, and make sure that these are considered in all the
appropriate places as we write the rest of the spec.  Fat characters and
strings should be an optional language feature: an implementation does
not have to support these, but if it does, it should do it in the
standard way.  (We can specify some marker that is put on the *features*
list if and only if fat characters are supported.)  I assume that any
Lisp that does not support fat characters will not do well in the
Japanese market, so there's plenty of incentive for big companies to
support this feature.

The specification of fat characters must be done in such a way that
currently legal implementations that do not support them can be left as
is; implementations that do support them must be able to do so without
penalizing users of normal non-fat strings, either in speed or storage
space.

The Symbolics spec, as described by Moon, meets these goals.  However, he
says that Fat-Char and String-Char form an EXHAUSTIVE partition of
Character.  This means that if an implementation supports any Char-Bit
or Char-Font bits, the fat strings must be able to accommodate these, in
addition to the longer Char-Code field.  Since the Char-Code will
typically be 16 bits, it would be nice to be able to store just the
char-code in a fat string, and not make the big jump to 32 bits per
character, which is the next stop for most stock-hardare machines.  I
don't know how important people feel this is, but if I were storing lots
of Japanese text in some application, I think I'd object to a 2X bloat
factor.

Two solutions are possible:

First, we could alter the type hierarchy as Moon suggests, and begin to
encourage implementations to exercise their right to have zero-length
font and bit fields in characters.  A lot of us have come to feel that
these were a major mistake and should begin to disappear.  (We wouldn't
legislate these fields away, just not implement them and encourage code
developers not to use them.)  An implementation that does this can have
Fat-Strings with 16-bits per char, all of it Char-Code.

Alternatively, we could say that Fat-Char is a subtype of Character, with
Char-Bit and Char-Font of zero.  String-Char is a subtype of Fat-Char,
with a Char-Code that fits (or can be mapped) into eight bits.  A
Thin-String holds only characters that are of type String-Char.  A
Fat-String holds Fat-Chars (some of which may also be String-Chars).  If
you want a vector of characters that have non-zero bits and fonts, then
you use (Vector Character).  I'm not sure what we do with the String
type-specifier; the two reasonable possibilities are to equate it to
Thin-String or tow the union of Thin and Fat Strings.

    A SIMPLE-STRING is any string, thin or fat, that is a SIMPLE-ARRAY.
    Since I don't know what the SIMPLE-STRING type is for, I don't know
    whether allowing SIMPLE-STRINGs to be fat is good or bad.

For those of us without microcoded type-dispatch, Simple-String is a
very important concept.  It says that you can access the Nth character
in the string simply by indexing off the address of the string by N
bytes (maybe adding some fixed offset).  On a lot of machines that is
one or two instructions, and no conditionals.  If Simple-Strings can be
either fat or thin, then you have to make a runtime decision about
whether to index by N bytes or 2N bytes.  So it is best to reserve
Simple-String for simple thin strings and maybe add another type for
Simple-Fat-String.

-- Scott