Menu

#1943 Tcl_UniChar is only 16 bits

obsolete: 8.4.5
open
1
2003-11-13
2002-07-06
No

Tcl_UniChar is defined as 'unsigned short' which
restricts the range of Unicode codepoints representable
in that type to the basic multilingual plane.
Additionally, TCL_UTF_MAX is defined as 3 which
restricts Utf-8 encoded strings to the same range.

Redefining Tcl_UniChar as 'unsigned int' and
TCL_UTF_MAX as 6 fixes the problem but leaves the
library incompatible with existing compiled objects.

If binary compatibility is a concern, perhaps the
internal datatype can be changed and new APIs created
for the wider set of values leaving the existing
typedef in place along with the existing interfaces.

Discussion

  • David Gravereaux

    Logged In: YES
    user_id=7549

    This interesting! Please teach me.

    >Tcl_UniChar is defined as 'unsigned short' which
    >restricts the range of Unicode codepoints representable
    >in that type to the basic multilingual plane.

    In what way does a unicode-16 glyph not fit in 16-bits? Has
    the standard changed? Is there a new design standard we
    can refer to? Has the maximum number of bytes to which a
    utf-8 encoded glyph can be converted to changed as well? Is
    Tcl now not standards complient to the unicode spec.. which
    spec are we now and what spec should we adhere to?

     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902

    IIRC, recent full Unicode standards mandate 32-bit
    characters so that virtually every character ever invented
    by mankind can be represented (though you can bet that
    someone somewhere has a plan to go to 64-bit chars just to
    make sure that extra-terrestrial languages can be
    incorporated into Unicode too!) Alas, doing this would blow
    the core up even larger than it is now, because there are a
    lot of places where we need fixed-width strings for
    performance reasons (e.g. [string index] on UTF-8 is an O(n)
    operation.)

    I somehow doubt that the default config of Tcl is going to
    change from 16-bit chars in the near future; the pay-off is
    just not good enough at the moment. However, I don't know
    if there's anything much in the core right now (aside from
    the declarations in tcl.h) that assumes a particular
    character size; removing such assumptions is fine to make
    for easier porting to 32-bit chars if/when it becomes a real
    issue.

     
  • Donal K. Fellows

    • priority: 5 --> 1
     
  • David Gravereaux

    Logged In: YES
    user_id=7549

    Right in section 2.2 of the 3.0 spec it states, "Plain unicode
    text consists of sequences of 16-bit unicode character
    codes." And it defines the use of paired codes (none are
    defined) as an extension, which might be what Keith is
    refering to. But it states quite clearly that a unicode
    character _is_ 16-bits (paired or ain't).

     
  • Keith Packard

    Keith Packard - 2002-07-09

    Logged In: YES
    user_id=50020

    (section 2.2 of Unicode 3.0)

    "From the full range of 65536 code values, 63486 are
    available to represent characters with 16-bit code values
    and 2048 code values are available to represent and
    additional 1048544 characters through paired 16-bit code
    values. These paired code values, or surrogates, will allow
    implementations access to additional characters in the
    future. <em>None of these surrogate pairs has been assigned
    in this version of the standard.</em>"

    In version 3.1 and 3.2, there have been significant
    additions to the Unicode standard, including many thousands
    of glyphs outside of what is now called the "basic
    multiligual plane" (BMP). Several new encodings have been
    defined, the basic 16-bit encoding (UCS2) no longer provides
    the 1-value 1-codepoint mapping which Tcl assumes.

    The encoding standard hasn't changed, 3.0 explicitly allows
    for codepoints outside the BMP, what's changed is that there
    are now many defined codepoints in this extended range.

    If the intent is to have Tcl support Unicode, then there is
    no question that these additional codepoints must be
    representable, the only question is how to do so. I
    recommend abandoing the 16 bit encoding as useless; it
    doesn't provide the 1-to-1 mapping which makes constant-time
    indexing trivial, and with surrogates, it requires more
    bytes on average to represent the same data than UTF-8. You
    may still want to provide the existing 16-bit APIs for some
    level of binary compatibility; those APIs can use surrogates
    to ensure all Unicode codepoints are representable. APIs
    involving a single codepoint should transition to using a
    new 32-bit datatype.

    Constant-time indexing can be implemented using indexes into
    the string, one trivial design would have a 32-bit index for
    each 256 codepoints. That means locating a codepoint by
    index takes a table lookup to find the offset (constant
    time) and a search through that portion of the data (time
    bounded by a constant).\

    UTF-8 now specifies that up to 6 bytes may be required to
    hold a value. Raising only that limit provides the minimum
    required functionality to use non-BMP glyphs in Tcl; I've
    implemented a modern X backend that can access non-BMP
    glyphs as long as the strings are never represented as
    arrays of Tcl_UniChars inside the interpreter. The only
    difficulty there was that the internal UTF8 to UCS converter
    is defined to return a Tcl_UniChar and so I had to use a
    separate complete implementation.

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2002-07-09

    Logged In: YES
    user_id=72656

    I've spoken with Markus Kuhn about this at length when
    moving to the 3.0 spec (3.[12] were out, but I decided to not
    go there for now). Tcl will be staying by default at the 16-bit
    Tcl_UniChar value with TCL_UTF_MAX == 3. These are
    tweakable parameters during compile, and one can specify
    32-bit Tcl_UniChars and TCL_UTF_MAX == 6. There is code
    to support it.

    At one point I tested the code to see that it ran through the
    test suite, which it did with a few minor problems (mostly in
    test suite assumptions). However, the 8.x line of Tcl needs
    to stay with 16-bit UCS2 for binary compatability reasons.

    Should we move to UCS4 in Tcl 9? That is an outstanding
    question. Java and Windows both use UCS2 only, X is
    moving towards full UCS4, but very few apps use it and I'm
    not even sure the extent to which it will be used 5 or 10 years
    from now.

    I think the best solution would be to introduce UCS2/UCS4
    conversion functions in the Tcl API. The reason for this is
    that, as Keith notes, they are useful now, and will definitely
    be useful going forward as we are bound to need to deal with
    UCS2 systems for a long time to come.

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2002-07-09
    • assigned_to: nijtmans --> hobbs
     
  • Keith Packard

    Keith Packard - 2002-07-09

    Logged In: YES
    user_id=50020

    To say that Java and Windows support UCS2 only misstates
    their capabilities. Both Java and Windows have support for
    surrogates. That's partially because surrogates are used
    extensively in supporting Chinese, especially the additions
    to Unicode 3.2 from the Hong Kong Supplementary Character
    Set. There is no reason to move the 16-bit APIs to 32-bit
    values; UCS-2 is a perfectly valid Unicode encoding, it just
    must allow for the presense of surrogates and respond
    accordingly.

    There are a few Tcl APIs which cannot pass surrogate pairs,
    one of those is the UTF8 to Tcl_UniChar API.

    The problem is that UTF-8 doesn't use surrogates, and so
    applications using UTF-8 expect the full range of Unicode
    values to be encoded without them. Converting from UTF-8 to
    UCS-2 should automatically insert surrogates while
    converting back should replace them with the conjoined value.

    Places in the API which deal with a single codepoint in
    relation to its UTF-8 encoding should provide an additional
    32-bit interface.

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2003-07-16

    Logged In: YES
    user_id=72656

    Tcl allows -DTCL_UTF_MAX=6 in CFLAGS for 8.5a and 8.4.4
    that kicks in an unsigned int Tcl_UniChar. This isn't a final
    solution, but half-way there for those who really want it.

     
  • Don Porter

    Don Porter - 2003-11-13
    • milestone: --> obsolete: 8.4.5