[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: A multi-byte character extension proposal
I got a mail from nippon UNIVAC person as for the matter.
He is one of the contributors of Kanji WG.
Here is his comment.
(I am just acting as a junction between japanese net and US.)
If someone is interested in his reply, I will forward
mails to him (since direct mailing is not premitted.)
Masayuki Ida
============================ his comment =================================
Date: Fri, 26 Jun 87 16:46:31 JST
From: tshimizu@xxx.yyy.junet (Toshihiko Shimizu)
To: ida@u-tokyo.junet
Subject: Our comments to Thom Linden's q
We have reviewed the JEIDA proposal and Mr. Thom Linden's
questions. We have been making efforts for an implementation of the
extention of Common Lisp for Japanese character processing on Explorer
lisp machine. Our efforts have been made in pararell to the JEIDA
Kanji WG activity, and our specifications had been nearly fixed before
the final proposal was issued. But our implementation almost conforms
to the proposal except the point left being implementation dependent.
We think it is important to answer Linden's question according to our
implementation for your information.
First we have to give an overview of our implementation. Our primary
goal is completely the same as the JEIDA proposal that we want to use
Japanese characters "as almost the same as" characters already
available. In Explorer, an extended ASCII character set called Lisp
Machine character set has been used. We have added a new character
set for Japanese characters called JIS character set which is defined
to be the standard in Japan. This set has double-byte character code.
Explorer has the capability to handle strings whose elements are 2
byte long. This string type can be considered to be a subtype of type
string. Then we use this type of strings to hold double-byte
characters. Apparently these strings are able to hold single-byte
characters as mixture. This implementation is considered almost the
same as the scheme using "internal-thin-string" type described in the
proposal. We are now preparing a paper on this implementation for
WGSYM IPSJ, September 1987. Please refer it for further detailes.
The followings are our answers to Linden's questions;
1)
All Common Lisp standard characters are included in the standared JIS
character set, but they have different character code from the ones in
ASCII character set. This situation is almost likely in case of usual
file systems which allow JIS character set. Then we think these
difference has to be preserved when files are read into Lisp as a
sequance of characters. After that we can think of parsing, discussed
later.
2)
Above interpretation seems to lead to a contradiction against the
description in CLtL (p.233). We think that two distinct character
objects may have the same print glyphs, but in this case they shold
have the same syntactic properties. Indeed they are different
characters but somtimes we doubt. Because they may be printed using
various fonts and sometimes these printed figures are very similar.
3), 4)
Actually we have both single-byte and double-byte representations for
some characters. But we never try to map them into the one except
when the Lisp reader parses them. This is because these difference
have to be preserved as described above. And we think that once these
two representation is mapped into the one, there are no reasonable way
to make inverse mapping. This is the crucial point for applications
on Lisp to interact with other conventional applications. Suppose we
have a text processing application on Lisp and we want use it against
a text file in which single-byte and double-byte characters are
containted in mixture. It is not desirable if all single-byte
characters in the source text file are mapped into double-byte ones.
5)
Now our stand point is that a double-byte character can be a standard
character within the parsing context only if its printed glyph is
regarded as a standard character. As a result, there must be some
test for this correspondence. Acturally we have this "equivalence
test". Both the single-byte character set and the double-byte
character set include standard characters. If a character from the
single-byte character set which is a standard character, there is a
corresponding character in the double-byte character set. And these
two characters pass the "equivalence test", but they never be EQ.
However this point may lead to a contradiction to the description in
CLtL (p.20).
5a)
Then, our implementation recognizes some double-byte characters as
standard characters. For example, STANDARD-CHAR-P returns T against
#\a in the double-byte character set.
5b)
Our implementation takes option 3 in the proposal. That is, we don't
distinguish single-byte and double-byte versions of symbols, but we
preserve these difference within strings. For example, two version of
a symbol 'LAMBDA are considered to be EQ, but two versions of a string
"LAMBDA" are distinguished, or not EQUAL, but they pass the test
described above. Further, there may be mixed versions of a string
"LAMBDA".
5c)
We might agree Linden's point if we didn't think about strings.
Actually our primary understanding was that there was no need to
distinguish such a difference for the sole purpose of Common Lisp.
But there is a certain requirement for interaction with conventional
applications in which distinction between single-byte and double-byte
version is significant. Then we decided that the distinction is not
neccessary for symbols which plays an important role in programs,
whereas it is neccessary for strings which are primarily used for
interaction with outer world, such as files, displays, and networks.
5d)
As we defined that a double-byte character may be a standard
character, it is consistent to define such a character to satisfy
ALPHA-CHAR-P. Then both version of a character 'a' satisfy
ALPHA-CHAR-P, ALPHANUMERICP and LOWER-CASE-P.
5e)
We think that these description sholud be eraborated, but the JEIDA
committee has decided that these should be left implementation
dependent.
6)
In our implementation, such syntactic attributes relevant to parsing
and format controlling are only defined for standard characters. That
is, if a character is a double-byte character and also a standared
character at the same time, it may have non-constituent syntax.
Indeed it has the same syntax attribute as the single-byte version of
it. For example, a string "123" in double-byte version is also parsed
into a number 123. Otherwise its syntax cannot be other than
constituent.
7)
We think it is not neccessary to have such a large readtable which
covers all characters of type string-char. We only have a readtable
for single-byte characters and uses the "equivalent" mapping for the
double-byte version of these characters. And the rest of double-byte
characters are defined to have constituent syntax.
8)
In our implementation, MAKE-DISPATCH-MACRO-CHARACTER against a
non-standard, double-byte character is an error.
------------- end of the message -----------