cl-irregsexp

Introduction

cl-irregsexp is an ASDF-installable library for fast text matching. It goes beyond the facilities allowed by traditional regular expressions, while also making the matchers easier to write and maintain.

Why?

Traditional regexps are everywhere and well-understood. There are already several fine regular expression toolkits for Common Lisp. They are both complete and mature. Why use cl-irregsexp?

The syntax: instead of writing (with cl-ppcre)

(register-groups-bind (method url version-major version-minor) 
	  ("(\\S+)\\s+(\\S+)\\s+HTTP/(\\d+).(\\d+)$" line)
	  ...)

write

(match-bind (method (+ (space)) url (+ (space)) "HTTP/" (version-major (integer)) "." (version-minor (integer)) (last))  
          line 
	  ...)

which might be clearer and doesn't involve mentioning the captured variables twice.

cl-irregsexp works seamlessly on both arrays of integers and on strings.
cl-irregsexp has excellent performance (for some operations) using open-coded Boyer-Moore matchers.

Why not?

No documentation.
The syntax is weird. For fast, complete and mature Perl compatible regexps check out cl-ppcre for example, or terse-ppcre.
Things are still in early stages. Anything might change, even the name.
Lack of testing and features.

Help!

Suggest a better name!
Improve the syntax.
Write documentation.
Add more features.

Write to the CL-IRREGSEXP-devel mailing list. Or add to the wiki.

Download

Tarball of the the latest git commit.

Here are some snapshots.

To check out the tree locally:

git clone http://common-lisp.net/projects/cl-irregsexp/cl-irregsexp.git

Implementation

The string matcher is generated with macros, and is completely inlined. It does not build a state machine (DFA) lookup table but uses case statements to implement the state machine in native code.

First the matcher description is translated to an intermediate form with the following primitives:

Fixed length constant string matcher. For each position in the string any number of characters may be accepted. For example the expression in traditional regexps "depot[A-Z](1|2)" is considered as a constant string.
Finite list of choices. For example "(or (integer) "undefined")".
Sequence of other matchers in order.
An object that only matches the end of the string.
Any other lisp form.

A few transforms are applied to the intermediate form, then it is output as Lisp, which will hopefully be compiled to efficient native machine code by the Lisp environment.

One particular case that has been optimised a little is that of searching for a constant string. The algorithm used is Boyer-Moore, but implementation has some unusual facets.

Native code generated.
Does not fall back to a slow matcher if the last character of the needle string is matched in the haystack, but continues to apply the algorithm to the rest of the needle.
Can be applied not only to simple string needles but any fixed length constant matcher as described above, so, for example, it can quickly search for a case insensitive needle.

Benchmark against other regex implementations

It is very irritating when people two different spellings for the same word. cl-irregsexp can sometimes search for both nearly as fast as it can search for just one.

To demonstrate its efficiency, I chose to compare how long it takes to search for "indecipherable" or "undecipherable" in a long string, among many regex implementations.

The haystack was one million random lowercase letters followed by "undecipherable". Each implementation reads the haystack into memory and then prints out how many milliseconds it took to search it for the two possibilities. Just to be safe, it also has to print the offset where the needle was found.

To make the test results less susceptible to slow starts from caches and branch prediction units, most implementations repeat the search 1000 times and print out the average number of milliseconds. The implementations all abort if the string is ever found at an incorrect position.

Implementation	Time (s), smaller is better
cl-ppcre.clozure	.214.70
-	.214.56
-	.214.19
-	.213.39
-	.215.26
cl-ppcre.sbcl	.101.44
-	.101.44
-	.101.42
-	.101.41
-	.101.42
ruby.rb	.41.18
-	.41.82
-	.42.08
-	.41.41
-	.42.30
python.py	.36.49
-	.36.56
-	.36.81
-	.36.63
-	.36.50
pcre	.32.12
-	.32.11
-	.32.11
-	.31.98
-	.33.01
perl.pl	.24.55
-	.24.56
-	.24.52
-	.24.60
-	.24.56
cl-irregsexp.clozure	.12.02
-	.12.04
-	.12.02
-	.12.11
-	.12.10
re2	.10.09
-	.10.08
-	.10.09
-	.10.08
-	.10.10
cl-irregsexp.sbcl	.6.24
-	.6.24
-	.6.25
-	.6.24
-	.6.23

Notes on the benchmark implementations

The source code for all is available in the cl-irregsexp download in the bench/ directory.

Ruby has the smallest and neatest benchmark program. It gets the award for concision.

Perl has the implementation involving the most one character global variables. It wins the second prize for speed and the first prize for being impenetrable to language dilettanti.

I also tried with clisp (a Common Lisp implementation that compiles to byte code). It took a couple of minutes with cl-irregsexp but a massive 2622.48s with cl-ppcre, so I did not include it in the results.

If you fancy sending in an example for another regexp implementation, I'd be pleased to add it!

Generating the test data

Here are the commands to make the data in the bash shell.

$ cat /dev/urandom | tr -d -c abcdefghijklmnopqrstuvwxyz | dd bs=1 count=1000000 > test-data
$ echo undecipherable >> test-data

Even faster

The code generated calculates a skip distance by doing a (case (peek-one-character) ...). SBCL unfortunately translates this into a sequence of conditional jumps apparently independently of the number of different cases. At some point it is surely better to use a jump table.

Project members

Valid XHTML 1.0 Strict