[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

The first note on kanji, sent to junet site in Jan 1986 and some reactions in japan



[Please note:  the character system I refer to as our
"current" implementation is new, and will be released
to customers this fall as part of Release 7.0 of our
software.]

    Date: Sat, 10 May 86 12:30:08+0900
    From: Masayuki Ida <tansei!a37078%utokyo-relay.csnet@CSNET-RELAY.ARPA>
    I am preparing a ducument on Kanji standard in japan for Common Lisp.
    The concepts in it is quite natural to Common Lisp, I think.
    But, I want to have opinions and advices from many persons as possible.
    The key ideas are follows;
     1) include japanese-char data type and normalized-string data type.
    Japanese-char type is a class for JIS 6226, and is a subtype of string-char.
    (physically, JIS 6226 char occupies 2 byte each.)

Make it a subtype of EXTENDED-STRING-CHAR instead.  Or just
CHARACTER.  Because the character codes are two bytes long,
many systems may wish to make their normal strings only hold
"ordinary" characters (i.e.  what are currently called
string-char).  What you are proposing here is an incompatible
change to STRING-CHAR, and I don't believe it is really
necessary.

In our implementation, Japanese characters are SUBTYPEP of CHARACTER.
STRING-CHAR's are characters that can be stored in ANY string.
We have special strings that can hold ANY character (the type is,
appropriately enough, STRING)

    Normalized-string type is a subtype of the string type, and its component characters
    are of japanese-char type characters.

This is an extremely poor choice of names for this type.  Besides,
you're getting into specifying the implementation.

    NOT-normalized string may contain japanese-char and other char randomly.

What about implementations which have no wish to make their usual
strings huge to support Japanese?  Since Japanese characters take
two bytes, we're talking a minimum of three bytes per character.
Implementations with a lot of documentation strings will find the
cost of that documentation suddenly tripled.  (Actually, you can
do it with less, because Kanji does not actually need the full 2^16
codes.  The rest of the languages (except things like Chinese) are
small enough to fit in at the end, or the start, or wherever the
implementation chooses.  (Remember, CL never specifies the character
codes for individual characters, just their meaning).

     2) include string-normalize, normalized-string-p, and japanese-char-p at least.

I take it string-normalize takes a string and generates a more compact
representation?  There's nothing wrong with that, and perhaps "normalize"
is the right word here.  However, specifying Japanese as the single and
sole type that is "normalized" to is rather ungeneral.  I don't think
ANY string type should be named NORMALIZED, nor do I think there
should be a NORMALIZED-STRING-P predicate that determines whether
a string is of some specific type.  Instead, NORMALIZED-STRING-P should
say whether STRING-NORMALIZE will, in fact, do anything, or if it will
just return its argument.

About JAPANESE-CHAR-P:  I am glad to see you have not tried to
do this with the CHAR-FONT field, but rather included it into
the CHAR-CODE.  This is indeed the right way to support different
character-set's, as opposed to character-styles (i.e. bold, italic,
etc).  CLtL did not make it clear what CHAR-FONT is intended
for, but if you use CHAR-FONT, then you get characters which appear
to be the "same" character, even though they are in different languages.
We eventually decided that CHAR-FONT was so ill-specified as to be
useless, and do not use it for anything, and CHAR-FONT-LIMIT is 1
in our system.

     3) char-code-limit should be greater than 16 bit to hold japanese-char character.
     char-bits-limit has not always meaningfull for japanese-char.
In our implementation, char-code-limit is 2^16, and it works fine.
Of course, we have provision for extending things beyond char-code-limit
as needed, but we have never needed that.

One other thing:  Please do not specify any standard for the values
of CHAR-CODE belonging to specific characters.  Doing so only invites
conflicts between standards for different languages.  For example,
Hebrew and Arabic might easily choose the same range of values, because
the standards groups weren't talking to each other, or didn't even
know about each other.  Or a standards group might choose a standard
which would make life very hard for an implementation.

Instead, specify that if these are to be written to a file, that it
be done with a stream gotten by
(OPEN PATHNAME :ELEMENT-TYPE 'EXTENDED-STRING-CHAR :DIRECTION :OUTPUT)
or
(OPEN PATHNAME :ELEMENT-TYPE 'CHARACTER :DIRECTION :OUTPUT)

Then standardize on interchange and communication formats for files.
For example, if there is an ISO standard for storing characters of
different character-sets, you might write

(OPEN PATHNAME :ELEMENT-TYPE 'EXTENDED-STRING-CHAR
	       :DIRECTION :OUTPUT
	       :FILE-DATA-FORMAT :ISO)

This leaves implementations with their own file format free to
work with that format by default.  For example, our file format
allows us to do a number of things which will not be part of any
character standard, such as include diagrams, marks, etc.  We
would not want to abandon our file format wholesale.

     4) length function invocation with japanese-char strings must return
    its apparent length, not the storage length.

Indeed!

CL has no concept of storage length.  The length function must return
the number of characters, which is the number of objects that can
be accessed with ELT.  I'm trying to reinforce what you're saying
here slightly.  An implementation that simply stores Japanese as
pairs of characters is incorrect, even if LENGTH returns the right
number, if ELT, AREF, etc. don't give back the entire Japanese
character.

    I think the above idea will cope with various implementations
    including the usual english-text-only systems.

    If you have an idea to discuss, please let me know
    Thank you.

    Masayuki ida
    junet: ida@ccut.u-tokyo.junet
    Csnet/arpanet: ida%utokyo-relay.csnet@csnet-relay


    --------------------- follwoing is the reaction to the above message --------

    1) Prof. Kigen Hasebe told me that there is a kind of standard for AT&T UNIX.
    2) Dr. Morisaki of NTT told me that MIT LCS  had an experience to cope with
      japanese characters on their NIL in 1984 already.

I don't know about NIL.  I do know that we have been supporting
Japanese characters for at least that long, and our experiences
with it are part of the motivation for the design of our current
character system.

    Several Common-lisp(-like) implementations told me that they have facility
    to cope with japanese characters with much the same idea I described above. 

    At Jeida, the working group for japanese character handling in Common Lisp
    started at the last april.