This is a regular expression compiler implemented in C11 based on Ken Thompson's paper on string search techniques. The project is heavily inspired by Russ Cox's article - although the implementation is a naive attempt to implement it in my own way to better understand the algorithm. One key difference is that instead of using the conversion of the regular expression to postfix and then using a pushdown stack to compile to an NFA - I opted to parse the expression into an AST and then traverse the AST to compile to an NFA.
Right now the compiler only supports the ?
, *
, +
, |
, .
operators and expression concatenation.
It also supports range based character such as [a-z0-9]
.
Anchored matches and reptition operators (such as {m}
, {m,n}
) should be easy to add but not yet done.
$ make clean && make
$./lexer_tests
$./parser_tests
$./nfa_executor_tests
Based on the benchmarking expression used in Russ Cox's article. The program generates the
expression
. For example for n=3
the expression would be a?a?a?aaa
and it would be tested against the string aaa
.
The program generates expressions for n=1 to n=100 and times their execution.
The output is in CSV format with the first value being value of n
and
second value being time taken to compile and execute the regular expression against the input string.
$./benchmark
Following is a comparison of performance of this implementation vs the Java regular expression library
Another benchmark is provided for expression (a|aa)*b
for string of the form a...a!
. With each increasing a
in the
string, the Java regex engine gets slower exponentially.