String and character literals (Gforth Manual)

Next: String words, Previous: String representations, Up: Strings and Characters [Contents][Index]

6.9.3 String and Character literals ¶

The nicest way to write a string literal is to write it as "STRING". For these kinds of string literals as well as for s\" some sequences are not put in the resulting string as is, but are replaced as shown below. The sequences are mostly the same as in C (exceptions noted):

\a: 7 #bell (alert)
\b: 8 #bs (backspace)
\e: 27 #esc (escape, not in C99)
\f: 12 #ff (form feed)
\l: 10 #lf (line feed, not in C)
\m: 13 10 CR LF (not in C)
\n: sequence produced by newline (in C this produces a LF)
\q: 34 " (double quote, not in C)
\r: 13 #cr (carriage return)
\t: 9 #tab (horizontal tab)
\uXXXX: Unicode code point XXXX (in hex); auto-merges surrogate pairs (not in Forth-2012 nor C)
\UXXXXXXXX: Unicode code point XXXXXXXX (in hex, not in Forth-2012 nor C)
\v: 11 VT (vertical tab)
\xXX: raw byte (not code point) XX (in hex)
\z: 0 NUL (not in C)
\\: \
\": " (the \" does not terminate the string; not in Forth-2012)
\XXX: raw byte; XXX is 1-3 octal digits (not in Forth-2012).

A \ before any other character is reserved.

Note that \xXX produces raw bytes, while \uXXXX and \UXXXXXXXX produce code points for the current encoding. E.g., if we use UTF-8 encoding and want to encode ä (code point U+00E4), you can write the letter ä itself, or write \xc3\xa4 (the UTF-8 bytes for this code point), \u00e4, or \U000000e4.

The "STRING" syntax is non-standard, so for portability you may want to use one of the following words:

s\" ( Interpretation ’ccc"’ – c-addr u  ) core-ext,file-ext “s-backslash-quote”

Interpretation: Parse the string ccc delimited by a " (but not \"), and convert escaped characters as described above. Store the resulting string in newly allocated heap memory, and push its descriptor c-addr u.
Compilation ( 'ccc"' -- ): Parse the string ccc delimited by a " (but not \"), and convert escaped characters as described above. Append the run-time semantics below to the current definition.
Run-time ( -- c-addr u ): Push a descriptor for the resulting string.

S" ( Interpretation ’ccc"’ – c-addr u  ) core,file “s-quote”

Interpretation: Parse the string ccc delimited by a " (double quote). Store the resulting string in newly allocated heap memory, and push its descriptor c-addr u.
Compilation ( 'ccc"' -- ): Parse the string ccc delimited by a " (double quote). Append the run-time semantics below to the current definition.
Run-time ( -- c-addr u ): Push a descriptor for the parsed string.

All these ways of interpreting strings consume heap memory; normally you can just live with the string consuming memory until the end of the Gforth session, but if that is a problem for some reason, you can free the string when you no longer need it. Forth-2012 only guarantees two buffers of 80 characters each, so in standard programs you should assume that the string lives only until the next-but-one s".

On the other hand, the compilation semantics of string literals of any form allocates the string in the dictionary, and you cannot free it, and it lives as long as the word it is compiled into (also in Forth-2012).

Likewise, You can get the code xc of a character C with 'C'. This way has been standardized since Forth-2012. An older way to get it is to use one of the following words:

char ( ’<spaces>ccc’ – c  ) core,xchar-ext

Skip leading spaces. Parse the string ccc and return c, the display code representing the first character of ccc.

[char] ( compilation ’<spaces>ccc’ – ; run-time – c  ) core,xchar-ext “bracket-char”

Compilation: skip leading spaces. Parse the string ccc. Run-time: return c, the display code representing the first character of ccc. Interpretation semantics for this word are undefined.

You usually use char outside and [char] inside colon definitions, or you just use 'C'.

Note that, e.g.,

"C" type

is (slightly) more efficient than

'C' xemit

because the latter converts the code point into a sequence of bytes and individually emits them. Similarly, dealing with general characters is usually more efficient when representing them as strings rather than code points.

There are the following words for producing commonly-used characters and strings that cannot be produced with S" or 'C':

newline ( – c-addr u ) gforth-0.5 “newline”

String containing the newline sequence of the host OS

bl ( – c-char  ) core “b-l”

c-char is the character value for a space.

#tab ( – c  ) gforth-0.2 “number-tab”

#lf ( – c  ) gforth-0.2 “number-l-f”

#cr ( – c  ) gforth-0.2 “number-c-r”

#ff ( – c  ) gforth-0.2 “number-f-f”

#bs ( – c  ) gforth-0.2 “number-b-s”

#del ( – c  ) gforth-0.2 “number-del”

#bell ( – c  ) gforth-0.2 “number-bell”

#esc ( – c  ) gforth-0.5 “number-esc”

#eof ( – c  ) gforth-0.7 “number-e-o-f”

actually EOT (ASCII code 4 aka ^D)