Xchars and Unicode (Gforth Manual)

Next: Internationalization and Localization, Previous: Pipes, Up: Other I/O [Contents][Index]

6.22.10 Xchars and Unicode ¶

ASCII is only appropriate for the English language. Most western languages however fit somewhat into the Forth frame, since a byte is sufficient to encode the few special characters in each (though not always the same encoding can be used; latin-1 is most widely used, though). For other languages, different char-sets have to be used, several of them variable-width. To deal with this problem, characters are often represented as Unicode codepoints on the stack, and as UTF-8 byte strings in memory. An Unicode codepoint often represents one application-level character, but Unicode also supports decomposed characters that consist of several code points, e.g., a base letter and a combining diacritical mark.

An Unicode codepoint can consume more than one byte in memory, so we adjust our terminology: A char is a raw byte in memory or a value in the range 0-255 on the stack. An xchar (for extended char) stands for one codepoint; it is represented by one or more bytes in memory and may have larger values on the stack. ASCII characters are the same as chars and as xchars: values in the range 0-127, and a single byte with that value in memory.

When using UTF-8 encoding, all other codepoints take more than one byte/char. In most cases, you can just treat such characters as strings in memory and don’t need to use the following words, but if you want to deal with individual codepoints, the following words are useful. We currently have no words for dealing with decomposed characters.

The xchar words add a few data types:

xc is an extended char (xchar) on the stack. It occupies one cell, and is a subset of unsigned cell. On 16 bit systems, only the BMP subset of the Unicode character set (i.e., codepoints <65536) can be represented on the stack. If you represent your application characters as strings at all times, you can avoid this limitation.
xc-addr is the address of an xchar in memory. Alignment requirements are the same as c-addr. The memory representation of an xchar differs from the stack representation, and depends on the encoding used. An xchar may use a variable number of chars in memory.
xc-addr u is a buffer of xchars in memory, starting at xc-addr, u chars (i.e., bytes, not xchars) long.

xc-size ( xc – u  ) xchar “x-c-size”

Computes the memory size of the xchar xc in chars.

x-size ( xc-addr u1 – u2  ) xchar “x-size”

Computes the memory size of the first xchar stored at xc-addr in chars.

xc@ ( xc-addr – xc  ) xchar-ext “xc-fetch”

Fetchs the xchar xc at xc-addr1.

xc@+ ( xc-addr1 – xc-addr2 xc  ) xchar “x-c-fetch-plus”

Fetchs the xchar xc at xc-addr1. xc-addr2 points to the first memory location after xc.

xc@+? ( xc-addr1 u1 – xc-addr2 u2 xc  ) gforth-experimental “x-c-fetch-plus-query”

Fetchs the first xchar xc of the string xc-addr1 u1. xc-addr2 u2 is the remaining string after xc.

xc!+? ( xc xc-addr1 u1 – xc-addr2 u2 f  ) xchar “x-c-store-plus-query”

Stores the xchar xc into the buffer starting at address xc-addr1, u1 chars large. xc-addr2 points to the first memory location after xc, u2 is the remaining size of the buffer. If the xchar xc did fit into the buffer, f is true, otherwise f is false, and xc-addr2 u2 equal xc-addr1 u1. XC!+? is safe for buffer overflows, and therefore preferred over XC!+.

xc!+ ( xc xc-addr1 – xc-addr2  ) xchar “x-c-store”

Stores the xchar xc at xc-addr1. xc-addr2 is the next unused address in the buffer. Note that this writes up to 4 bytes, so you need at least 3 bytes of padding after the end of the buffer to avoid overwriting useful data if you only check the address against the end of the buffer.

xchar+ ( xc-addr1 – xc-addr2  ) xchar “x-char-plus”

Adds the size of the xchar stored at xc-addr1 to this address, giving xc-addr2.

xchar- ( xc-addr1 – xc-addr2  ) xchar-ext “x-char-minus”

Goes backward from xc_addr1 until it finds an xchar so that the size of this xchar added to xc_addr2 gives xc_addr1.

+x/string ( xc-addr1 u1 – xc-addr2 u2  ) xchar-ext “plus-x-slash-string”

Step forward by one xchar in the buffer defined by address xc-addr1, size u1 chars. xc-addr2 is the address and u2 the size in chars of the remaining buffer after stepping over the first xchar in the buffer.

x\string- ( xc-addr u1 – xc-addr u2  ) xchar-ext “x-backslash-string-minus”

Step backward by one xchar in the buffer defined by address xc-addr and size u1 in chars, starting at the end of the buffer. xc-addr is the address and u2 the size in chars of the remaining buffer after stepping backward over the last xchar in the buffer.

-trailing-garbage ( xc-addr u1 – xc-addr u2  ) xchar-ext “minus-trailing-garbage”

Examine the last XCHAR in the buffer xc-addr u1—if the encoding is correct and it repesents a full char, u2 equals u1, otherwise, u2 represents the string without the last (garbled) xchar.

x-width ( xc-addr u – n  ) xchar-ext “x-width”

n is the number of monospace ASCII chars that take the same space to display as the the xchar string starting at xc-addr, using u chars; assuming a monospaced display font, i.e. char width is always an integer multiple of the width of an ASCII char.

xkey ( – xc  ) xchar “x-key”

Reads an xchar from the terminal. This will discard all input events up to the completion of the xchar.

xc-width ( xc – n  ) xchar-ext “x-c-width”

xc has a width of n times the width of a normal fixed-width glyph.

xhold ( xc –  ) xchar-ext “x-hold”

Used between <<# and #>. Prepend xc to the pictured numeric output string. Alternatively, use holds.

xc, ( xchar –  ) xchar “x-c-comma”