Regular expressions are pattern matching algorithms for strings found in
many contemporary languages. You can add regular expression
functionality to Gforth with require regexp.fs
.
The classical implementation for this pattern matching is a backtracking
algorithm, which is also necessary if you want to have features like
backreferencing. Gforth implements regular expressions by providing a
language to define backtracking programs for pattern matching. Basic
element is the control structure FORK
… JOIN
, which
is a forward call within a word, and therefore allows to code a
lightweight try and fail control structure.
FORK
( compilation – orig ; run-time f – ) gforth-0.7 “FORK”
AHEAD-like control structure: calls the code after JOIN.
JOIN
( orig – ) gforth-0.7 “JOIN”
THEN-like control structure for FORK
You can program any sort of arbitrary checks yourself by computing a
flag and ?LEAVE
when the check fails. Your regular expression
code is enclosed in ((
and ))
.
((
( addr u – ) regexp-pattern “((”
start regexp block
))
( – flag ) regexp-pattern “))”
end regexp block
Pattern matching in regular expressions have character sets as elements,
so a number of functions allow you to create and modify character sets
(called charclass
). All characters here are bytes, so this
doesn’t extend to unicode characters.
charclass
( – ) regexp-cg “charclass”
Create a charclass
+char
( char – ) regexp-cg “+char”
add a char to the current charclass
-char
( char – ) regexp-cg “-char”
remove a char from the current charclass
..char
( start end – ) regexp-cg “..char”
add a range of chars to the current charclass
+chars
( addr u – ) regexp-cg “+chars”
add a string of chars to the current charclass
+class
( class – ) regexp-cg “+class”
union of charclass class and the current charclass
-class
( class – ) regexp-cg “-class”
subtract the charclass class from the current charclass
There are predefined charclasses and tests for them, and generic checks. If a check fails, the next possible alternative of the regular expression is tried, or a loop is terminated.
c?
( addr class – ) regexp-pattern “c?”
check addr for membership in charclass class
-c?
( addr class – ) regexp-pattern “-c?”
check addr for not membership in charclass class
\d
( addr – addr’ ) regexp-pattern “\d”
check for digit
\s
( addr – addr’ ) regexp-pattern “\s”
check for blanks
.?
( addr – addr’ ) regexp-pattern “.?”
check for any single charachter
-\d
( addr – addr’ ) regexp-pattern “-\d”
check for not digit
-\s
( addr – addr’ ) regexp-pattern “-\s”
check for not blank
`
( "char" – ) regexp-pattern “‘”
check for particular char
`?
( "char" – ) regexp-pattern “‘?”
-`
( "char" – ) regexp-pattern “-‘”
check for particular char
You can certainly also check for start and end of the string, and for whole string constants.
\^
( addr – addr ) regexp-pattern “\^”
check for string start
\$
( addr – addr ) regexp-pattern “\$”
check for string end
str=?
( addr1 addr u – addr2 ) regexp-pattern “str=?”
check for a computed string on the stack (possibly a backreference)
="
( <string>" – ) regexp-pattern “="”
check for string
Loops that check for repeated character sets can be greedy or non-greedy.
{**
( addr – addr addr ) regexp-pattern “begin-greedy-star”
greedy zero-or-more pattern
**}
( sys – ) regexp-pattern “end-greedy-star”
end of greedy zero-or-more pattern
{++
( addr – addr addr ) regexp-pattern “begin-greedy-plus”
greedy one-or-more pattern
++}
( sys – ) regexp-pattern “end-greedy-plus”
end of greedy one-or-more pattern
{*
( addr – addr addr ) regexp-pattern “begin-non-greedy-star”
non-greedy zero-or-more pattern
*}
( addr addr’ – addr’ ) regexp-pattern “end-non-greedy-star”
end of non-greedy zero-or-more pattern
{+
( addr – addr addr ) regexp-pattern “begin-non-greedy-plus”
non-greedy one-or-more pattern
+}
( addr addr’ – addr’ ) regexp-pattern “end-non-greedy-plus”
end of non-greedy one-or-more pattern
Example: Searching for a substring really is a non-greedy match of anything in front of it.
//
( – ) regexp-pattern “//”
search for string
Alternatives are written with
{{
( addr – addr addr ) regexp-pattern “begin-alternatives”
Start of alternatives
||
( addr addr – addr addr ) regexp-pattern “next-alternative”
separator between alternatives
}}
( addr addr – addr ) regexp-pattern “end-alternatives”
end of alternatives
You can use up to 9 variables named \1
to \9
to refer to
matched substrings
\(
( addr – addr ) regexp-pattern “\(”
start of matching variable; variables are referred as \\1–9
\)
( addr – addr ) regexp-pattern “\)”
end of matching variable
\0
( – addr u ) regexp-pattern “\0”
the whole string
Certainly, you can also write code to replace patterns you found.
s>>
( addr – addr ) regexp-replace “s>>”
Start replace pattern region
>>
( addr – addr ) regexp-replace “>>”
Start arbitrary replacement code, the code shall compute a string
on the stack and pass it to <<
<<
( run-addr addr u – run-addr ) regexp-replace “<<”
Replace string from start of replace pattern region with addr u
<<"
( "string<">" – ) regexp-replace “<<"”
Replace string from start of replace pattern region with string
s//
( addr u – ptr ) regexp-replace “s//”
start search/replace loop
//s
( ptr – ) regexp-replace “//s”
search end
//o
( ptr addr u – addr’ u’ ) regexp-replace “//o”
end search/replace single loop
//g
( ptr addr u – addr’ u’ ) regexp-replace “//g”
end search/replace all loop
Examples can be found in test/regexp-test.fs
.