Status report: regex replacement - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Status report: regex replacement |
Date | |
Msg-id | 28172.1044468601@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Status report: regex replacement
Re: Status report: regex replacement Re: Status report: regex replacement |
List | pgsql-hackers |
I have just committed the latest version of Henry Spencer's regex package (lifted from Tcl 8.4.1) into CVS HEAD. This code is natively able to handle wide characters efficiently, and so it avoids the multibyte performance problems recently exhibited by Wade Klaver. I have not done extensive performance testing, but the new code seems at least as fast as the old, and much faster in some cases. Also, we now have a regex flavor that is an exact match for recent Tcl releases and a close match for recent Perl releases; it sports back references and lookahead among other niceties. There's some stuff still to do: 1. There are a couple of minor incompatibilities between the "advanced" regex syntax implemented by this package and the syntax handled by our old code; in particular, backslash is now a special character within bracket expressions. It seems to me that we'd better offer a switch to allow backwards compatibility. This is easily done as far as the code is concerned: the regex library actually offers three regex flavors, "advanced", "extended", and "basic", where "extended" matches what we had before ("extended" and "basic" correspond to different levels of the POSIX 1003.2 standard). We just need a way to expose that knob to the user. I am thinking about inventing yet another GUC parameter, say set regex_flavor = advancedset regex_flavor = extendedset regex_flavor = basic We could satisfy the immediate need with just a boolean "advanced_regex = on/off", but it seems forward-looking to allow for the possibility of more flavors in future. (For one thing, this would offer an easy place to select a different regex package, in case anyone still wants to play around with sre or the other alternatives that were mentioned yesterday.) Any suggestions about the name of the parameter? 2. Documentation. I've transformed Spencer's manual page into SGML and added it to func.sgml, but it's starting to look a tad, um, bulky: http://developer.postgresql.org/docs/postgres/functions-matching.html#FUNCTIONS-POSIX-REGEXP The regex section now accounts for 1200+ out of func.sgml's 7500 lines. Should it be split out as an appendix, or is it okay where it is? 3. I've been toying with the idea of getting rid of the special-purpose matching code for LIKE (see adt/like.c and like_match.c), and reimplementing LIKE as a front-end to the regex engine; all it would need is to translate LIKE patterns into regex style, much as we already do for SQL99's SIMILAR TO patterns. This would reduce the maintenance needs for LIKE by a great deal. In some preliminary tests here, it seemed that the special-purpose LIKE code is faster than equivalent regex matching would be --- but I didn't try the multibyte code path, nor any but the simplest of patterns. Anyone want to try some more extensive benchmarking? 4. The new regex code is 8-bit-clean (no dependency on null-terminated strings), so it'd be feasible to implement regex matching for BYTEA. Over to you on that one, Joe. regards, tom lane
pgsql-hackers by date: