[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
A multi-byte character extension proposal
We have had the opportunity to review the
English translation of part of the JEIDA proposal
regarding the extension of Common Lisp to large
(multi-byte) character sets. We were impressed with the scope of the
issues addressed. However, the translation was vague on a few points,
and we would like clarification. Your anwsers to the following
questions would be appreciated:
The paper proposes that Lisp characters be assigned unique codes over
a potentially large domain (possibly all the integers between 0 and
65535). Are characters with different codes always syntactically
distinct? Can the standard character #\( have two different codes,
corresponding, for example, to two different external file system
representations of that character? If so, would the Lisp reader
always parse these two representations identically?
CLtL says, (p. 233) "The code attribute is intended to distinguish
among the printed glyphs and formatting functions for characters."
Does the JEIDA proposal permit two different string-chars to have
the same print glyph, '(' for example, but different syntactical
Under the proposal, it is up to each implementation how it
chooses to translate the contents of files or terminal input when
passed as input to the Lisp system. We have both single-byte and
double-byte character conventions for file contents; thus is it
allowable to map both of these sets of codes into the one,
internal Lisp character code set when inputting data to Lisp, and
adopt our own conventions for translating output back to single
and double byte?
An elaboration of the the previous question: Is it possible for an
implementation to represent all of the standard characters internally
with 2-byte codes, and to map some 2-byte character codes and some
1-byte character codes in system files onto the same set of 2-byte
internal codes for the standard characters when read into Lisp?
The English copy we saw of the proposal did not contain section 4.4.
Based on our own translation from the original in Japanese, this
section seems to discuss implementation issues. Given a system such
as our's, which has both single-byte and double-byte conventions for
some characters, including the Common Lisp standard characters, there
seem to be two possible treatments of double byte characters. The
first is the case where a double-byte character can be a standard
character. The second is where a double-byte character cannot be
a standard character.
Is the proposal's default, in section 4.4, that a Lisp
implementation reading a file on a system with both single-byte
and double-byte conventions, not recognize any double-byte character
in that file as a standard character?
There are three example Lisp forms with underlining which seem
to indicate three options of single-byte/double-byte equivalency.
Consider the symbol 'abc' in these forms. Is the difference
between option 1 and option 2 whether the Lisp system would
recognize a single-byte version and a double-byte version
of this symbol-name in the same file as referring to the same
1. (list abc /fg " xy " )
2. (list abc /fg " xy " )
-- ---- ----- --- ----
3. (list abc /fg " xy " )
If a Lisp system does not recognize these two symbol names
as equivalent, doesn't that require the Lisp system to have
two characters with different codes but the same print glyph?
Under the default proposal, if the character object print
syntax "#\a" or "#\A" is read from a file, is alpha-char-p true
1. if the 'a' had been encoded as a single byte?
2. if the 'a' had been encoded as a double byte?
3. if the 'A' had been encoded as a single byte?
4. if the 'A' had been encoded as a double byte?
Is section 4.4 a part of the proposal to ANSI, or
has it been included for merely illustrative purposes?
If you could elaborate (in English) on the content of section
4.4, we would greatly appreciate it.
Even if the Lisp system supports a large character set, only
standard characters have, as a default, non-constituent syntax
type, constituent character attributes relevant to parsing of numbers
and symbols, or defined syntax within a format control string.
If a Lisp system supports a large character code set, need it allow
every character of type string-char to have a non-constituent syntax
type defined in the readtable, or is the proposal's default that
only standard characters need be represented in the readtable?
A specific case related to the previous question: suppose #\% were a
non-standard character, but still a string-char in some implementation
of Lisp. Is
necessarily permitted in every implementation that supports #\% as a