6.26 Regular Expressions

Regular expressions are pattern matching algorithms for strings found in many contemporary languages. You can add regular expression functionality to Gforth with require regexp.fs.

The classical implementation for this pattern matching is a backtracking algorithm, which is also necessary if you want to have features like backreferencing. Gforth implements regular expressions by providing a language to define backtracking programs for pattern matching. Basic element is the control structure FORKJOIN, which is a forward call within a word, and therefore allows to code a lightweight try and fail control structure.

FORK ( compilation – orig ; run-time f –  ) gforth-0.7 “FORK”

AHEAD-like control structure: calls the code after JOIN.

JOIN ( orig –  ) gforth-0.7 “JOIN”

THEN-like control structure for FORK

You can program any sort of arbitrary checks yourself by computing a flag and ?LEAVE when the check fails. Your regular expression code is enclosed in (( and )).

(( ( addr u –  ) regexp-pattern “((”

start regexp block

)) ( – flag  ) regexp-pattern “))”

end regexp block

Pattern matching in regular expressions have character sets as elements, so a number of functions allow you to create and modify character sets (called charclass). All characters here are bytes, so this doesn’t extend to unicode characters.

charclass ( ) regexp-cg “charclass”

Create a charclass

+char ( char –  ) regexp-cg “+char”

add a char to the current charclass

-char ( char –  ) regexp-cg “-char”

remove a char from the current charclass

..char ( start end –  ) regexp-cg “..char”

add a range of chars to the current charclass

+chars ( addr u –  ) regexp-cg “+chars”

add a string of chars to the current charclass

+class ( class –  ) regexp-cg “+class”

union of charclass class and the current charclass

-class ( class –  ) regexp-cg “-class”

subtract the charclass class from the current charclass

There are predefined charclasses and tests for them, and generic checks. If a check fails, the next possible alternative of the regular expression is tried, or a loop is terminated.

c? ( addr class –  ) regexp-pattern “c?”

check addr for membership in charclass class

-c? ( addr class –  ) regexp-pattern “-c?”

check addr for not membership in charclass class

\d ( addr – addr’  ) regexp-pattern “\d”

check for digit

\s ( addr – addr’  ) regexp-pattern “\s”

check for blanks

.? ( addr – addr’  ) regexp-pattern “.?”

check for any single charachter

-\d ( addr – addr’  ) regexp-pattern “-\d”

check for not digit

-\s ( addr – addr’  ) regexp-pattern “-\s”

check for not blank

` ( "char" –  ) regexp-pattern “‘”

check for particular char

`? ( "char" –  ) regexp-pattern “‘?”
-` ( "char" –  ) regexp-pattern “-‘”

check for particular char

You can certainly also check for start and end of the string, and for whole string constants.

\^ ( addr – addr  ) regexp-pattern “\^”

check for string start

\$ ( addr – addr  ) regexp-pattern “\$”

check for string end

str=? ( addr1 addr u – addr2  ) regexp-pattern “str=?”

check for a computed string on the stack (possibly a backreference)

=" ( <string>" –  ) regexp-pattern “="”

check for string

Loops that check for repeated character sets can be greedy or non-greedy.

{** ( addr – addr addr  ) regexp-pattern “begin-greedy-star”

greedy zero-or-more pattern

**} ( sys –  ) regexp-pattern “end-greedy-star”

end of greedy zero-or-more pattern

{++ ( addr – addr addr  ) regexp-pattern “begin-greedy-plus”

greedy one-or-more pattern

++} ( sys –  ) regexp-pattern “end-greedy-plus”

end of greedy one-or-more pattern

{* ( addr – addr addr  ) regexp-pattern “begin-non-greedy-star”

non-greedy zero-or-more pattern

*} ( addr addr’ – addr’  ) regexp-pattern “end-non-greedy-star”

end of non-greedy zero-or-more pattern

{+ ( addr – addr addr  ) regexp-pattern “begin-non-greedy-plus”

non-greedy one-or-more pattern

+} ( addr addr’ – addr’  ) regexp-pattern “end-non-greedy-plus”

end of non-greedy one-or-more pattern

Example: Searching for a substring really is a non-greedy match of anything in front of it.

// ( ) regexp-pattern “//”

search for string

Alternatives are written with

{{ ( addr – addr addr  ) regexp-pattern “begin-alternatives”

Start of alternatives

|| ( addr addr – addr addr  ) regexp-pattern “next-alternative”

separator between alternatives

}} ( addr addr – addr  ) regexp-pattern “end-alternatives”

end of alternatives

You can use up to 9 variables named \1 to \9 to refer to matched substrings

\( ( addr – addr  ) regexp-pattern “\(”

start of matching variable; variables are referred as \\1–9

\) ( addr – addr  ) regexp-pattern “\)”

end of matching variable

\0 ( – addr u  ) regexp-pattern “\0”

the whole string

Certainly, you can also write code to replace patterns you found.

s>> ( addr – addr  ) regexp-replace “s>>”

Start replace pattern region

>> ( addr – addr  ) regexp-replace “>>”

Start arbitrary replacement code, the code shall compute a string on the stack and pass it to <<

<< ( run-addr addr u – run-addr  ) regexp-replace “<<”

Replace string from start of replace pattern region with addr u

<<" ( "string<">" –  ) regexp-replace “<<"”

Replace string from start of replace pattern region with string

s// ( addr u – ptr  ) regexp-replace “s//”

start search/replace loop

//s ( ptr –  ) regexp-replace “//s”

search end

//o ( ptr addr u – addr’ u’  ) regexp-replace “//o”

end search/replace single loop

//g ( ptr addr u – addr’ u’  ) regexp-replace “//g”

end search/replace all loop

Examples can be found in test/regexp-test.fs.