[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
A multi-byte character extension proposal

To: common-lisp@SAIL.STANFORD.EDU, ida%tansei.u-tokyo.junet@RELAY.CS.NET
Subject: A multi-byte character extension proposal
From: Masayuki Ida <a37078%tansei.u-tokyo.junet%utokyo-relay.csnet@RELAY.CS.NET>
Date: Mon, 18 May 87 14:06:46+0900
Date: Mon, 18 May 87 09:44:19 JST
From: moto@XXX.XXX.junet (MOTOYOSHI)
To: ida@tansei.cc.u-tokyo.junet
Subject: Kanji Proposal


-------------------- Beginning of text --------------------
@Comment[-*- Mode: Text -*-]
@heading[Digest]

@heading[1. Hierarcy of characters and strings]

    Let the value of char-code-limit be large enough to include all
characters.

	char > string-char >= internal-string-char > standard-char

	string >= internal-thin-string > simple-internal-thin-string

	simple-string >= simple-internal-thin-string

	string = (or internal-thin-string (vector string-char))

		Type internal-thin-string and (vector string-char) are
		disjoint or identical.

	simple-string = (or simple-internal-thin-string
			    (simple-array string-char (*)))

		Type simple-internal-thin-string and (simple-array
		string-char (*)) are disjoint or identical.

	notes:	A > B means B is a subtype of A,
		A >= B means B is a subtype of A or B is equal to A.

@Heading[2. Print width]

	Only standard characters are required to have fix-pitched
print width. Use the newly introdued function 'write-width'
to know the print width of an expression.

@Heading[3. Functions]

	Functions dealing with strings should work as before,
expect ones which change the contents of internal-thin-string
to non internal-thin-char's.

	Functions producing strings should create (vector
string-char), not internal-thin-string, unless they were
explicitly specified.

	Funtions comaring strings should copare them elementwise.
Therefore it is possible that a (vector string-char) is equal
to an internal-thin-string.
@newpage
@Heading[1. A proposal for embedding multi-byte characters]

The Kanji Working Group (KWG) examined implementation of facilities for
multi-byte character to Common Lisp.  This report is the result of many
discussions of many proposals.  Of course, this report doesn't satisfy
all proposals, but it is very close.

In order to decide on a final proposal, we chose essential and
desirable characteristics of a working multi-byte character system.
Chapter 2 describes these characteristics in some detail.

Chapter 3 describes additional features to Common Lisp which will be useful
not just for multi-byte character, but also for many other kinds of character sets.
This chapter describes internal data structures.  If this proposal is
accepted in Common Lisp, it will be easy for countries to add original mechanisms.

Chapters 4 describes proposed changes to @I[Common Lisp -- The Language]
(CLtL).

@Heading[2. Additional features for embedding multi-byte characters.]

This chapter describes design principles which can be used to design
multi-byte character language extensions to Common Lisp.

There are many programming languages which can multi-byte characters.
Most of them can use multi-byte character as string character data 
but not as variables or function names. 

It is necessary for programming languages like Lisp that use symbolic data 
to be able to process not only single-byte characters but also multi-byte characters.
That is, it should be possible to use multi-byte characters in character string and 
symbols, and it must be possible to store both kinds of characters in them.

Treating multi-byte characters just like other alpha-numeric characters
means that multi-byte character must be treated as a single character object.
Many of the present implementations of Lisp treat multi-byte character as
pairs of bytes.  Alternatively, they use a different data type which
doesn't permit multi-byte character to be mixed with standard characters.
Such systems are not useful for user.

Thus, the basic design principles for embedding multi-byte character to Common Lisp are:

@Begin[Itemize]

Multi-byte character should be treated like single-byte character, that is, 
a multi-byte character is one character object.

@End[Itemize]

@Begin[Itemize]

A program which was coded without explicit attention for multi-byte character should 
handle multi-byte character data as is.

@End[Itemize]

These principles provide sufficient functionality, but we can't ignore
efficiency.  So we considered the next principle:

@Begin[Itemize]

The performance of the system in terms of CPU and memory
utilization should not be consideraly affected in programs which do not use multi-byte
characters.

@End[Itemize]

This principle is contradictory to other principles, but this can't be
ignored when we consider the users of actual systems, so we have to
compromise.  We think that following methods will satisfy both of these
requirements.

@Heading[3. Common parts which we implement.]

This chapter describes the implementation of multiple character sets in Common Lisp. 

To treat multi-byte characters like single-byte characters, the multi-byte character must be
included in the set of possible character codes.

We consider the following implementation methods.

@Begin[Itemize]

Add multi-byte characters by setting the variable char-code-limit to a large number.

@End[Itemize]

In this case, the single-byte character set and the multi-byte character 
set must be ordered into a single sequence of character codes. This means multi-byte 
character set must not overlap with the single-byte character set.  This method could 
be satisfied within most implementations with ease.

If we use this method, it is possible to use multi-byte characters with
fonts in Common Lisp, and operations that work for single-byte 
character will also work for multi-byte character without any change.

This implementation method has problems with efficiency.  
In the case that the value of character code is greater than size of 1 byte
(multi-byte characters are in this category), memory utilization is
affected.  A string containing only one single-byte character is 2 bytes long.
The same problem would also occur with symbol p-names.  If we can solve the problem 
for strings, we can solve other problems, so we will start by considering only strings.

To avoid this memory utilization problem, it is possible to optimize and 
make single-byte character strings by packing internally. In other words, 
to have two kinds of data types and not show it to user. There is only one type of 
data from the viewpoint of users, which means that every function which uses strings 
will continue to work as defined.

This can be implemented in almost everywhere without so many costs.  The only
problem occurs when a function attempts to put a multi-byte character into an optimized
and packed sigle-byte-only string.  To work according to the definition, the implementation 
must unpack the original packed string. This presents an implementation inefficiency which 
the user may find undesirable.

One solution would be to
@Begin[Itemize]

Generate errors for operations that try to use multi-byte characters into 
single-byte string and presenting two string datatypes to users.

@End[Itemize]

We propose this latter implementation.  Common lisp should have 2 string
types to treat multi-byte characters efficiently.  The first of these is
@b[1string0], which stores any character of type @B[1string-char0], i.e.,
whose @I[2bits0] and @I[2font0] are both zero.  The type of string is
@B[1internal-thin-string0] which is the optimized character string.
@B[1internal-thin-char0] is a subtype of @B[1character0] and can be inserted into string
@B[1internal-thin-string0].  The predicate which tests for this type of character is 
@B[1internal-thin-char-p0].

The type @B[1internal-thin-char0] is a subtype of @B[1string-char0], and is a
supertype of @B[1standard-char0]. 
The data type hierarchy for @B[1character0] and @B[1string0] is shown in figure 1.
@b[1Internal-thin-char0] and @b[1string-char0] may be equal as it is possible that situations
 may arise where both sets describe the same character-set.
 This is equivalent to the type of system that has only one type of character from the 
viewpoint of the user as discussed in the previous chapter.  This proposal permits both 
kinds of implementations. 
@newpage                        
@Begin[Verbatim]                        
				character
				    |
			       string-char
				    |
			    internal-thin-char
				    |
			       standard-char

@Center[Fig-1.a  Structure of character type]

				string
				  |
		-----------------------------------
		|		  |		  |
		|	    simple-string	  |
		|		  |		  |
       internal-thin-string	  |   (vector string-char)
		|		  |		  |
		-----------------------------------
		|				  |
		|				  |
   simple-internal-thin-string  (simple-array string-char (*))


@Center[Fig-1.b  Structure of string type]




@End[Verbatim]

To compare @B[1string0] characters with @B[1internal-thin-string0] characters, it is
necessary to convert both to the @B[1string-char0] format.  This means that
the same character is the same object regardless of whether it is found
in an @B[1internal-thin-string0] or a normal @B[1string0].


Next we must discuss character input. The proposal does not discuss what is stored 
in files, nor what happens between the Lispimplementation and a terminal.  
Each system will implement this in itsown way.  Instead, let us discuss the data 
as passed to lisp programs. We think that treating all input data as @B[1string0] 
is the safest possible course. Since a symbol's p-name string should not be modified, 
it can be optimized.

This may cause performance problems for programs which use only
single-byte characters. The variable @B[1*read-default-string-type*0] is
provided for these programs.  When its value is @B[1internal-thin-string0], the system 
expects single-byte characters only. so the system will return input data
in the form of @B[1internal-thin-string0]. Though it is possible that the system may 
choose to ignore this variable.
@newpage
@Heading[4 Proposed changes to CLtL to support multiple character sets.]

In this section, we list proposed modifications to CLtL.  Chapters 13,
14 and 18 of CLtL are concerned very greatly with multi-byte character, so we specify
modifications to these chapters by making a list of all constants,
functions and variables.  For other chapters we specify only additional
and modifying parts.  Those portions which are not mentioned are
unchanged.

  @b(2  Data Types)

  @b(2.5.2 Strings)
@begin(equation)
   "a string is a specialized vector .... type string-char"
				
   "a string is a specialized vector .... type string-char or @B[internal-thin-char]"
@end(equation)
  @b(2.15 Overlap,Inclusion and Disjointness of Types)

   a description of type string-char is changed to :

      Type standard-char is a subtype of @B[internal-thin-char].
      @B[internal-thin-char] is a subtype of string-char.  string-char is a
      subtype of character.

   and add the following :
    
      Type @B[internal-thin-string] is a subtype of vector because @B[internal-thin-string] means 
      (vector internal-thin-char).

   a description of type string is changed to :

      Type string is a subtype of vector because string means (or
      (vector string-char) internal-thin-string).  Type (vector
      string-char) and @B[internal-thin-string] are disjoint or equality.

   a description of type simple-vector,simple-string ... is changed to :
  
      Type simple-vector,simple-string and simple-bit-vector are disjoint subtype of
      simple-array because each one means (simple-array t (*)),
      (or (simple-array string-char (*)),(or (simple-array internal-thin-char (*)) and
      (simple-array bit (*)).

   and add following :

      Type simple-internal-thin-string means (simple-array
      internal-thin-char (*)) and is a subtype of @B[internal-thin-string]. 

      Type (simple-array string-char (*)) and simple-internal-thin-string are disjoint or
      equality.


  @b(4  Type Specifiers)

  @b(4.1 Type Specifier Symbols)

   add followings to system defined type specifiers :

      simple-internal-thin-string
      internal-thin-string
      internal-thin-char


  @b(4.5 Type Specifiers That Specialize)

   "The specialized types (vector string-char) ... data types."
					
   "The specialized types (vector internal-thin-char), (vector
   string-char) and (vector bit) are so useful that they have the
   special names string and bit-vector.  Every implementation of Common
   Lisp must provide distinct representation for string and bit-vector
   as distinct specialized data types."

@begin(equation)
  @b(13  Characters)

  @b(13.1 Character Attributes)


    char-code-limit@>[constant]
    char-font-limit@>[constant]
    char-bits-limit@>[constant]

  @b(13.2 Predicates on Characters)

   standard-char-p char@>[constant]

   graphic-char-p char@>[constant]
@begin(quotation)
      a description "graphic characters of font 0 are all of the same width when printed" in
      the CLtL changed to "standard-char without #\Newline of font 0 are all of the same
      width when printed".
@end(quotation)

   string-char-p char @>[function]

   internal-thin-char-p char@>[function]
@begin(quotation)
      this function must be added.  
      the argument char must be a character object.  internal-thin-char-p
      is true if char can be stored into a internal-thin-string, and
      otherwise is false.
@end(quotation)

   alpha-char-p char@>[function]

   upper-case-p char@>[function]
   lower-case-p char@>[function]
   both-case-p char@>[function]
      "If a character is either ... alphabetic."
		
      "If a character is either uppercase or lowercase, it is necessarily character
      that alpha-char-p returns true."

   digit-char-p char &optional (radix 10)@>[function]

   alphanumericp char@>[function]

   char= character &rest more-characters@>[function]
   char/= character &rest more-characters@>[function]
   char< character &rest more-characters@>[function]
   char> character &rest more-characters@>[function]
   char<= character &rest more-characters@>[function]
   char>= character &rest more-characters@>[function]

   char-equal character &rest more-characters@>[function]
   char-not-equal character &rest more-characters@>[function]
   char-lessp character &rest more-characters@>[function]
   char-greaterp character &rest more-characters@>[function]
   char-not-greaterp character &rest more-characters@>[function]
   char-not-lessp character &rest more-characters@>[function]

  @b(13.3 Character Construction and Selection)

   char-code char@>[function]
   
   char-bits char@>[function]
   
   char-font char@>[function]
   
   code-char char &optional (bits 0) (font 0)@>[function]

   make-char char &optional (bits 0) (font 0)@>[function]

  @b(13.4 Character Conversion)
   character char@>[function]
   
   char-upcase char@>[function]
   char-downcase char@>[function]
   
   digit-char weight &optional (radix 10) (font 0)@>[function]

   char-int char@>[function]

   int-char char@>[function]

   char-name char@>[function]

   name-char char@>[function]
   
  @b(13.5 Character control-bit functions)

   char-control-bit@>[constant]
   char-meta-bit@>[constant]
   char-super-bit@>[constant]
   char-hyper-bit@>[constant]

   char-bit char name@>[function]

   set-char-bit char name newvalue@>[function]

  @b(14 Sequence)

  @b(14.1 Simple sequence functions)
   elt sequence index@>[Function]

   subseq sequence start &optional end@>[Function]

   copy-seq sequence@>[Function]

   length sequence@>[Function]

   reverse sequence@>[Function]

   nreverse sequence@>[Function]

   make-sequence type size &key :initial-element@>[Function]

  @b(14.2 Sequence connection)
   concatenate result-type &rest sequences@>[Function]

   map result-type function sequence &rest more-sequences@>[Function]

   some predicate sequence &rest more-sequences@>[Function]
   every predicate sequence &rest more-sequences@>[Function]
   notany predicate sequence &rest more-sequences@>[Function]
   notevery predicate sequence &rest more-sequences@>[Function]

   reduce function sequence@>[Function]
			&key :from-end :start :end :initial-value

  @b(14.3 Sequence correction)
   fill sequence item &key :start :end@>[Function]

   replace sequence1 sequence2 &key :start1 :end1 :start2 :end2@>[Function]

   remove item sequence@>[Function]
			&key :from-end :test :test-not 
			     :start :end :count :key
   remove-if test sequence@>[Function]
			&key :from-end :start
			     :end :count :key
   remove-if-not test sequence@>[Function]
			&key :from-end :start
			     :end :count :key


   delete item sequence@>[Function]
			&key :from-end :test :test-not 
			     :start :end :count :key
   remove-if test sequence@>[Function]
			&key :from-end :start
			     :end :count :key
   remove-if-not test sequence@>[Function]
			&key :from-end :start
			     :end :count :key

   remove-duplicates sequence@>[Function]
			&key :from-end :test :test-not 
			     :start :end :key
   delete-duplicates sequence@>[Function]
			&key :from-end :test :test-not 
			     :start :end :key

   subsutitute newitem test sequence@>[Function]
			&key :from-end :test :test-not 
			     :start :end :count :key
   subsutitute-if newitem test sequence@>[Function]
			&key :from-end :start :end :count :key
   subsutitute-if-not newitem test sequence@>[Function]
			&key :from-end :start :end :count :key

   nsubsutitute newitem test sequence@>[Function]
			&key :from-end :test :test-not 
			     :start :end :count :key
   nsubsutitute-if newitem test sequence@>[Function]
			&key :from-end :start :end :count :key
   nsubsutitute-if-not newitem test sequence@>[Function]
			&key :from-end :start :end :count :key

  @b(14.4 Search)
   find item sequence @>[Function]
			&key :from-end :test :test-not
			     :start :end :key
   find-if test sequence @>[Function]
			&key :from-end :start :end :key
   find-if-not test sequence>[Function]
			&key :from-end :start :end :key

   position item sequence@>[Function]
			&key :from-end :test :test-not
			     :start :end :key
   position-if test sequence@>[Function]
			&key :from-end :start :end :key
   position-if-not test sequence@>[Function]
			&key :from-end :start :end :key

   count item sequence@>[Function]
			&key :from-end :test :test-not
			     :start :end :key
   count-if item sequence@>[Function]
			&key :from-end :start :end :key
   count-if-not item sequence@>[Function]
			&key :from-end :start :end :key

   mismatch sequence1 sequence2@>[Function]
			&key :from-end :test :test-not
			     :key :start1 :start2 
			     :end1 :end2

   search sequence1 sequence2@>[Function]
			&key :from-end :test :test-not
			     :key :start1 :start2
			     :end1 :end2



  @b(14.5 Sort,merge)

   sort sequence predicate &key :key@>[Function]

   stable-sort sequence predicate &key :key@>[Function]

   merge result-type sequence1 sequence2 predicate &key :key@>[Function]


  @b(18 Strings)

   "the type string is identical ... (array string-char (*))."
				
   "the type string is identical to the type
    (or (vector internal-thin-char) (vector string-char)), 
    which in turn is the same as (or (array internal-thin-char (*))
    (array string-char (*)))."

  @b(18.3 String Construction and Manipulation)

   make-string size &key :initial-element@>[function]

@begin(quotation)
      add following :

      To make an internal-thin-string, you should use make-array or make-sequence.
@end(quotation)

   
  @b(22  Input/Output)

  @b(22.2 Input Functions)

  @b(22.2.1 Output to Character Stream)

   add following :
  
      *read-default-string-type*@>[variables]
@begin(quotation)
        The value is string or internal-thin-string.  This determines string that the function 
        read takes whether type string or internal-thin-string.
@end(quotation)

  @b(22.3 Output Functions)

  @b(22.3.1 Output from Character Stream)
@begin(quotation)
   add following :
@end(quotation)

      write-width object@>[function]
			&key :unit-type :stream :escape :radix :base
			     :circle :pretty :label :length :case :gensym :array
@begin(quotation)
        This function returns the printed width as the value of the unit
        specified by :unit-type when then printed representation of
        object is written to the output stream specified by :stream.  It
        returns nil when object includes control characters
        (#\Newline,#\Tab etc).  The default of :unit-type is byte.  The
        value which we can specify :unit-type depends on implementation.
@end(quotation)
@end(equation)
@newpage
@Heading[Appendix Proposed Japanese character processing facilities for Common Lisp.]

In addition to the modification of CLtL, here are some suggestions for systems 
including Japanese characters.

 1). How should system behave for Japanese characters both
under unmodified part of CLtL and the part changed for multi-byte
processing.

 2). About function that are specific to Japanese and no at all related
to multi-byte processing.

 Notes: All Japanese characters are constituent. JIS is a abreviation of Japanese Industry 
Standard.
@begin(equation)
   @b(13. Characters)

   @b(13.1. Character Attributes)

    char-code-limit char @>[Function]
@begin(quotation)
	The value of char-code-limit should be large enough to include Japanese characters,
	e.g. 65536.
@end(quotation)

   @b(13.2. Predicates on Characters)

    standard-char-p char @>[Function]
@begin(quotation)
	Return nil for all Japanese characters.
@end(quotation)
	
    graphic-char-p char @>[Function]
@begin(quotation)
	Return t for Japanese characters.
@end(quotation)

   internal-thin-char-p char @>[Function]
@begin(quotation)
	The result depends on each implementation that whether the Japanese character is in
	internal-thin-string or not.
@end(quotation)

    alpha-char-p char @>[Function]
@begin(quotation)
	Return nil for all character except alphabets in Japanese character.  It depends on
        each implementation whether to return t or nil for alphabets in Japanese characters.
@end(quotation)

@newpage

    jis-char-p char@>[Function]
@begin(quotation)
	The argument char has to be a character type object.  jis-char-p is true if the 
argument is included in JIS C-6226, and otherwise false.
@end(quotation)

    japanese-char-p char@>[Function]
@begin(quotation)
	The argument char has to be a character type object.  japanese-char-p is true if the
	argument is a Japanese character and is otherwise false.  All characters that satisfy
        jis-char-p must satisfy japanese-char-p; other characters might.
@end(quotation)
	
    kanji-char-p char@>[Function]
@begin(quotation)
	The argument char has to be character type object.  kanji-char-p is true if the
        argument is one of the 6353 Kanji characters in JIS C6226(3.1.8), the repeat symbol,
	the kanji numeric zero or the same as above symbol for a total of 6356 characters
        that also satisfy jis-char-p. 
@end(quotation)

    hiragana-char-p char@>[Function]
@begin(quotation)
	The argument char has to be character type object.
	hiragana-char-p is true if the argument is one of the 83
	hiragana characters in JIS C6226(3.1.4), the hiragana repeat
	symbol, or dakuten for a total of 85 characters that also
	satisfy jis-char-p.
@end(quotation)

    katakana-char-p char@>[Function]
@begin(quotation)
	The argument char has to be a character type object.
	katakana-char-p is true if the argument is one of the 86
	hiragana characters in JIS C6226(3.1.5), long-sound-symbol,
	katakana-repeat symbol, or katakana-dakuten for a total of 89
	characters that also satisfy jis-char-p.
@end(quotation)

    kana-char-p char@>[Function]
@begin(quotation)
	equivalence (or (hiragana-char-p char) (katakana-char-p char))
@end(quotation)

    upper-case-p char@>[Function]
    lower-case-p char@>[Function]
    both-case-p char@>[Function]
@begin(quotation)
	These are nil if the argument does not satisfy alpha-char-p.
	Japanese characters which satisfy alpha-char-p should be treated
	as normal alphabetic characters.
@end(quotation)
@newpage
    digit-char-p char &optional (radix 10)@>[Function]
@begin(quotation)
	digit-char-p is nil if the argument is a Japanese character.
@end(quotation)

    alphanumericp char@>[Function]
@begin(quotation)
	equivalence (or (alpha-char-p char) (not (null (digit-char-p char))))
@end(quotation)

    char= character &rest more-characters@>[Function]
    char/= character &rest more-characters@>[Function]
    char< character &rest more-characters@>[Function]
    char> character &rest more-characters@>[Function]
    char<= character &rest more-characters@>[Function]
    char>= character &rest more-characters@>[Function]
@begin(quotation)
	The ordering of hiragana, katakana, kanji follows the JIS ordering.
@end(quotation)

   @b(13.4 character Conversions)

    char-upcase char@>[Function]
    char-downcast char@>[Function]
@begin(quotation)
	These return the argument if the argument does not satisfy
	alpha-char-p.  It depends on the implementation whether these
	work on the alphabets included in JIS or not. But it should be
	consistent with upper-case-p, lower-case-p, both-case-p.
@end(quotation)
@end(equation)
Prev by Date: Introduction to a multi-byte character extension
Next by Date: Format
Previous by thread: Introduction to a multi-byte character extension
Next by thread: A multi-byte character extension proposal
Index(es):
- Date
- Thread