Tcl_UniChar is defined as 'unsigned short' which
restricts the range of Unicode codepoints representable
in that type to the basic multilingual plane.
Additionally, TCL_UTF_MAX is defined as 3 which
restricts Utf-8 encoded strings to the same range.
Redefining Tcl_UniChar as 'unsigned int' and
TCL_UTF_MAX as 6 fixes the problem but leaves the
library incompatible with existing compiled objects.
If binary compatibility is a concern, perhaps the
internal datatype can be changed and new APIs created
for the wider set of values leaving the existing
typedef in place along with the existing interfaces.
Logged In: YES
user_id=7549
This interesting! Please teach me.
>Tcl_UniChar is defined as 'unsigned short' which
>restricts the range of Unicode codepoints representable
>in that type to the basic multilingual plane.
In what way does a unicode-16 glyph not fit in 16-bits? Has
the standard changed? Is there a new design standard we
can refer to? Has the maximum number of bytes to which a
utf-8 encoded glyph can be converted to changed as well? Is
Tcl now not standards complient to the unicode spec.. which
spec are we now and what spec should we adhere to?
Logged In: YES
user_id=79902
IIRC, recent full Unicode standards mandate 32-bit
characters so that virtually every character ever invented
by mankind can be represented (though you can bet that
someone somewhere has a plan to go to 64-bit chars just to
make sure that extra-terrestrial languages can be
incorporated into Unicode too!) Alas, doing this would blow
the core up even larger than it is now, because there are a
lot of places where we need fixed-width strings for
performance reasons (e.g. [string index] on UTF-8 is an O(n)
operation.)
I somehow doubt that the default config of Tcl is going to
change from 16-bit chars in the near future; the pay-off is
just not good enough at the moment. However, I don't know
if there's anything much in the core right now (aside from
the declarations in tcl.h) that assumes a particular
character size; removing such assumptions is fine to make
for easier porting to 32-bit chars if/when it becomes a real
issue.
Logged In: YES
user_id=7549
Right in section 2.2 of the 3.0 spec it states, "Plain unicode
text consists of sequences of 16-bit unicode character
codes." And it defines the use of paired codes (none are
defined) as an extension, which might be what Keith is
refering to. But it states quite clearly that a unicode
character _is_ 16-bits (paired or ain't).
Logged In: YES
user_id=50020
(section 2.2 of Unicode 3.0)
"From the full range of 65536 code values, 63486 are
available to represent characters with 16-bit code values
and 2048 code values are available to represent and
additional 1048544 characters through paired 16-bit code
values. These paired code values, or surrogates, will allow
implementations access to additional characters in the
future. <em>None of these surrogate pairs has been assigned
in this version of the standard.</em>"
In version 3.1 and 3.2, there have been significant
additions to the Unicode standard, including many thousands
of glyphs outside of what is now called the "basic
multiligual plane" (BMP). Several new encodings have been
defined, the basic 16-bit encoding (UCS2) no longer provides
the 1-value 1-codepoint mapping which Tcl assumes.
The encoding standard hasn't changed, 3.0 explicitly allows
for codepoints outside the BMP, what's changed is that there
are now many defined codepoints in this extended range.
If the intent is to have Tcl support Unicode, then there is
no question that these additional codepoints must be
representable, the only question is how to do so. I
recommend abandoing the 16 bit encoding as useless; it
doesn't provide the 1-to-1 mapping which makes constant-time
indexing trivial, and with surrogates, it requires more
bytes on average to represent the same data than UTF-8. You
may still want to provide the existing 16-bit APIs for some
level of binary compatibility; those APIs can use surrogates
to ensure all Unicode codepoints are representable. APIs
involving a single codepoint should transition to using a
new 32-bit datatype.
Constant-time indexing can be implemented using indexes into
the string, one trivial design would have a 32-bit index for
each 256 codepoints. That means locating a codepoint by
index takes a table lookup to find the offset (constant
time) and a search through that portion of the data (time
bounded by a constant).\
UTF-8 now specifies that up to 6 bytes may be required to
hold a value. Raising only that limit provides the minimum
required functionality to use non-BMP glyphs in Tcl; I've
implemented a modern X backend that can access non-BMP
glyphs as long as the strings are never represented as
arrays of Tcl_UniChars inside the interpreter. The only
difficulty there was that the internal UTF8 to UCS converter
is defined to return a Tcl_UniChar and so I had to use a
separate complete implementation.
Logged In: YES
user_id=72656
I've spoken with Markus Kuhn about this at length when
moving to the 3.0 spec (3.[12] were out, but I decided to not
go there for now). Tcl will be staying by default at the 16-bit
Tcl_UniChar value with TCL_UTF_MAX == 3. These are
tweakable parameters during compile, and one can specify
32-bit Tcl_UniChars and TCL_UTF_MAX == 6. There is code
to support it.
At one point I tested the code to see that it ran through the
test suite, which it did with a few minor problems (mostly in
test suite assumptions). However, the 8.x line of Tcl needs
to stay with 16-bit UCS2 for binary compatability reasons.
Should we move to UCS4 in Tcl 9? That is an outstanding
question. Java and Windows both use UCS2 only, X is
moving towards full UCS4, but very few apps use it and I'm
not even sure the extent to which it will be used 5 or 10 years
from now.
I think the best solution would be to introduce UCS2/UCS4
conversion functions in the Tcl API. The reason for this is
that, as Keith notes, they are useful now, and will definitely
be useful going forward as we are bound to need to deal with
UCS2 systems for a long time to come.
Logged In: YES
user_id=50020
To say that Java and Windows support UCS2 only misstates
their capabilities. Both Java and Windows have support for
surrogates. That's partially because surrogates are used
extensively in supporting Chinese, especially the additions
to Unicode 3.2 from the Hong Kong Supplementary Character
Set. There is no reason to move the 16-bit APIs to 32-bit
values; UCS-2 is a perfectly valid Unicode encoding, it just
must allow for the presense of surrogates and respond
accordingly.
There are a few Tcl APIs which cannot pass surrogate pairs,
one of those is the UTF8 to Tcl_UniChar API.
The problem is that UTF-8 doesn't use surrogates, and so
applications using UTF-8 expect the full range of Unicode
values to be encoded without them. Converting from UTF-8 to
UCS-2 should automatically insert surrogates while
converting back should replace them with the conjoined value.
Places in the API which deal with a single codepoint in
relation to its UTF-8 encoding should provide an additional
32-bit interface.
Logged In: YES
user_id=72656
Tcl allows -DTCL_UTF_MAX=6 in CFLAGS for 8.5a and 8.4.4
that kicks in an unsigned int Tcl_UniChar. This isn't a final
solution, but half-way there for those who really want it.