Thread: Status report: regex replacement

Status report: regex replacement

From
Tom Lane
Date:
I have just committed the latest version of Henry Spencer's regex
package (lifted from Tcl 8.4.1) into CVS HEAD.  This code is natively
able to handle wide characters efficiently, and so it avoids the
multibyte performance problems recently exhibited by Wade Klaver.
I have not done extensive performance testing, but the new code seems
at least as fast as the old, and much faster in some cases.

Also, we now have a regex flavor that is an exact match for recent
Tcl releases and a close match for recent Perl releases; it sports
back references and lookahead among other niceties.  

There's some stuff still to do:

1. There are a couple of minor incompatibilities between the "advanced"
regex syntax implemented by this package and the syntax handled by our
old code; in particular, backslash is now a special character within
bracket expressions.  It seems to me that we'd better offer a switch
to allow backwards compatibility.  This is easily done as far as the
code is concerned: the regex library actually offers three regex
flavors, "advanced", "extended", and "basic", where "extended" matches
what we had before ("extended" and "basic" correspond to different
levels of the POSIX 1003.2 standard).  We just need a way to expose
that knob to the user.  I am thinking about inventing yet another GUC
parameter, say
set regex_flavor = advancedset regex_flavor = extendedset regex_flavor = basic

We could satisfy the immediate need with just a boolean "advanced_regex
= on/off", but it seems forward-looking to allow for the possibility of
more flavors in future.  (For one thing, this would offer an easy place
to select a different regex package, in case anyone still wants to play
around with sre or the other alternatives that were mentioned
yesterday.)

Any suggestions about the name of the parameter?

2. Documentation.  I've transformed Spencer's manual page into SGML and
added it to func.sgml, but it's starting to look a tad, um, bulky:
http://developer.postgresql.org/docs/postgres/functions-matching.html#FUNCTIONS-POSIX-REGEXP
The regex section now accounts for 1200+ out of func.sgml's 7500 lines.
Should it be split out as an appendix, or is it okay where it is?

3. I've been toying with the idea of getting rid of the special-purpose
matching code for LIKE (see adt/like.c and like_match.c), and
reimplementing LIKE as a front-end to the regex engine; all it would
need is to translate LIKE patterns into regex style, much as we already
do for SQL99's SIMILAR TO patterns.  This would reduce the maintenance
needs for LIKE by a great deal.  In some preliminary tests here, it
seemed that the special-purpose LIKE code is faster than equivalent
regex matching would be --- but I didn't try the multibyte code path,
nor any but the simplest of patterns.  Anyone want to try some more
extensive benchmarking?

4. The new regex code is 8-bit-clean (no dependency on null-terminated
strings), so it'd be feasible to implement regex matching for BYTEA.
Over to you on that one, Joe.
        regards, tom lane


Re: Status report: regex replacement

From
Jon Jensen
Date:
On Wed, 5 Feb 2003, Tom Lane wrote:

> 1. There are a couple of minor incompatibilities between the "advanced"
> regex syntax implemented by this package and the syntax handled by our
> old code; in particular, backslash is now a special character within
> bracket expressions.  It seems to me that we'd better offer a switch
> to allow backwards compatibility.  This is easily done as far as the
> code is concerned: the regex library actually offers three regex
> flavors, "advanced", "extended", and "basic", where "extended" matches
> what we had before ("extended" and "basic" correspond to different
> levels of the POSIX 1003.2 standard).  We just need a way to expose
> that knob to the user.  I am thinking about inventing yet another GUC
> parameter, say
> 
>     set regex_flavor = advanced
>     set regex_flavor = extended
>     set regex_flavor = basic
[snip]
> Any suggestions about the name of the parameter?

Actually I think 'regex_flavor' sounds fine.

Jon


Re: Status report: regex replacement

From
"Christopher Kings-Lynne"
Date:
> >     set regex_flavor = advanced
> >     set regex_flavor = extended
> >     set regex_flavor = basic
> [snip]
> > Any suggestions about the name of the parameter?
> 
> Actually I think 'regex_flavor' sounds fine.

Not more Americanisms in our config files!! :P

Chris



Re: Status report: regex replacement

From
Tom Lane
Date:
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
>> Actually I think 'regex_flavor' sounds fine.

> Not more Americanisms in our config files!! :P

You want regex_flavour? ;-)
        regards, tom lane


Re: Status report: regex replacement

From
"Christopher Kings-Lynne"
Date:
> "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
> >> Actually I think 'regex_flavor' sounds fine.
>
> > Not more Americanisms in our config files!! :P
>
> You want regex_flavour? ;-)

Hehe - yeah I don't really care.  I have to use 'color' often enough
accessing 100% of the world's programming APIs...

How about regex_type, regex_mode, regex_option, etc.? ;)

Chris



Re: Status report: regex replacement

From
Tom Lane
Date:
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
>> You want regex_flavour? ;-)

> Hehe - yeah I don't really care.  I have to use 'color' often enough
> accessing 100% of the world's programming APIs...

> How about regex_type, regex_mode, regex_option, etc.? ;)

Well, I used "flavor" in my previous message because Friedl uses that
term consistently in his book to refer to the various idiosyncracies
of different regex implementations.  It seems like a good term to me;
the differences among them don't rise to the level of being different
languages, yet they're distinctly different.

type, mode, option all seem, um, colourless ...
        regards, tom lane


Re: Status report: regex replacement

From
Hannu Krosing
Date:
Christopher Kings-Lynne kirjutas N, 06.02.2003 kell 03:56:
> > >     set regex_flavor = advanced
> > >     set regex_flavor = extended
> > >     set regex_flavor = basic
> > [snip]
> > > Any suggestions about the name of the parameter?
> >
> > Actually I think 'regex_flavor' sounds fine.
>
> Not more Americanisms in our config files!! :P

Maybe support both, like for ANALYZE/ANALYSE ?

While at it, could we make another variant - ANALÜÜSI - which
would be native for me ;)

--
Hannu Krosing <hannu@tm.ee>


Re: Status report: regex replacement

From
Tatsuo Ishii
Date:
> I have just committed the latest version of Henry Spencer's regex
> package (lifted from Tcl 8.4.1) into CVS HEAD.  This code is natively
> able to handle wide characters efficiently, and so it avoids the
> multibyte performance problems recently exhibited by Wade Klaver.
> I have not done extensive performance testing, but the new code seems
> at least as fast as the old, and much faster in some cases.

I have tested the new regex with src/test/mb and it all passed. So the
new code looks safe at least for EUC_CN, EUC_JP, EUC_KR, EUC_TW,
MULE_INTERNAL, UNICODE, though the test does not include all possible
regex patterns.
--
Tatsuo Ishii


Re: Status report: regex replacement

From
Hannu Krosing
Date:
On Thu, 2003-02-06 at 13:25, Tatsuo Ishii wrote:
> > I have just committed the latest version of Henry Spencer's regex
> > package (lifted from Tcl 8.4.1) into CVS HEAD.  This code is natively
> > able to handle wide characters efficiently, and so it avoids the
> > multibyte performance problems recently exhibited by Wade Klaver.
> > I have not done extensive performance testing, but the new code seems
> > at least as fast as the old, and much faster in some cases.
> 
> I have tested the new regex with src/test/mb and it all passed. So the
> new code looks safe at least for EUC_CN, EUC_JP, EUC_KR, EUC_TW,
> MULE_INTERNAL, UNICODE, though the test does not include all possible
> regex patterns.

Perhaps we should not call the encoding UNICODE but UTF8 (which it
really is). UNICODE is a character set which has half a dozen official
encodings and calling one of them "UNICODE" does not make things very
clear.

-- 
Hannu Krosing <hannu@tm.ee>


Re: Status report: regex replacement

From
Tatsuo Ishii
Date:
> Perhaps we should not call the encoding UNICODE but UTF8 (which it
> really is). UNICODE is a character set which has half a dozen official
> encodings and calling one of them "UNICODE" does not make things very
> clear.

Right. Also we perhaps should call LATIN1 or ISO-8859-1 more precisely
way since ISO-8859-1 can be encoded in either 7 bit or 8 bit(we use
this). I don't know what it is called though.
--
Tatsuo Ishii


Re: Status report: regex replacement

From
Hannu Krosing
Date:
Tatsuo Ishii kirjutas N, 06.02.2003 kell 17:05:
> > Perhaps we should not call the encoding UNICODE but UTF8 (which it
> > really is). UNICODE is a character set which has half a dozen official
> > encodings and calling one of them "UNICODE" does not make things very
> > clear.
> 
> Right. Also we perhaps should call LATIN1 or ISO-8859-1 more precisely
> way since ISO-8859-1 can be encoded in either 7 bit or 8 bit(we use
> this). I don't know what it is called though.

I don't think that calling 8-bit ISO-8859-1 ISO-8859-1 can confuse
anybody, but UCS-2 (ISO-10646-1), UTF-8 and UTF-16 are all widely used. 

UTF-8 seems to be the most popular, but even XML standard requires all
compliant implementations to deal with at least both UTF-8 and UTF-16.

-- 
Hannu Krosing <hannu@tm.ee>


Re: Status report: regex replacement

From
Tatsuo Ishii
Date:
> > Right. Also we perhaps should call LATIN1 or ISO-8859-1 more precisely
> > way since ISO-8859-1 can be encoded in either 7 bit or 8 bit(we use
> > this). I don't know what it is called though.
> 
> I don't think that calling 8-bit ISO-8859-1 ISO-8859-1 can confuse
> anybody, but UCS-2 (ISO-10646-1), UTF-8 and UTF-16 are all widely used. 

I just pointed out that ISO-8859-1 is *not* an encoding, but a
character set.

> UTF-8 seems to be the most popular, but even XML standard requires all
> compliant implementations to deal with at least both UTF-8 and UTF-16.

I don't think PostgreSQL is going to natively support UTF-16.
--
Tatsuo Ishii


Re: Status report: regex replacement

From
Tim Allen
Date:
On Fri, 7 Feb 2003 00:49, Hannu Krosing wrote:
> Tatsuo Ishii kirjutas N, 06.02.2003 kell 17:05:
> > > Perhaps we should not call the encoding UNICODE but UTF8 (which it
> > > really is). UNICODE is a character set which has half a dozen official
> > > encodings and calling one of them "UNICODE" does not make things very
> > > clear.
> >
> > Right. Also we perhaps should call LATIN1 or ISO-8859-1 more precisely
> > way since ISO-8859-1 can be encoded in either 7 bit or 8 bit(we use
> > this). I don't know what it is called though.
>
> I don't think that calling 8-bit ISO-8859-1 ISO-8859-1 can confuse
> anybody, but UCS-2 (ISO-10646-1), UTF-8 and UTF-16 are all widely used.
>
> UTF-8 seems to be the most popular, but even XML standard requires all
> compliant implementations to deal with at least both UTF-8 and UTF-16.

Strong agreement from me, for whatever value you wish to place on my opinion.
UTF-8 is a preferable name to UNICODE. The case for distinguishing 7-bit from
8-bit latin1 seems much weaker.

Tim

--
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/



Re: Status report: regex replacement

From
Hannu Krosing
Date:
Tatsuo Ishii kirjutas R, 07.02.2003 kell 04:03:

> > UTF-8 seems to be the most popular, but even XML standard requires all
> > compliant implementations to deal with at least both UTF-8 and UTF-16.
> 
> I don't think PostgreSQL is going to natively support UTF-16.

By natively, do you mean "as backend storage" format or "as supported
client encoding" ?

-- 
Hannu Krosing <hannu@tm.ee>


Re: Status report: regex replacement

From
Peter Eisentraut
Date:
Tom Lane writes:

> code is concerned: the regex library actually offers three regex
> flavors, "advanced", "extended", and "basic", where "extended" matches
> what we had before ("extended" and "basic" correspond to different
> levels of the POSIX 1003.2 standard).  We just need a way to expose
> that knob to the user.  I am thinking about inventing yet another GUC
> parameter, say

Perhaps it should be exposed through different operators.  If someone uses
packages (especially functions) provided externally, they might have a
hard time coordinating what flavor is required by which part of what he is
using.

By analogy, imagine there was an environment variable that switched all
grep's to egrep's.  That would be a complete mess.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: Status report: regex replacement

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> code is concerned: the regex library actually offers three regex
>> flavors, "advanced", "extended", and "basic", where "extended" matches
>> what we had before ("extended" and "basic" correspond to different
>> levels of the POSIX 1003.2 standard).  We just need a way to expose
>> that knob to the user.  I am thinking about inventing yet another GUC
>> parameter, say

> Perhaps it should be exposed through different operators.  If someone uses
> packages (especially functions) provided externally, they might have a
> hard time coordinating what flavor is required by which part of what he is
> using.

But one could argue the contrary, too: if you've got an
externally-provided package there may be no convenient way to get it to
use, say, ~!#@ in place of ~.  GUC variables can come in awfully handy
in scenarios like that.

Also, if one *can* alter the SQL context in which a regexp is used, there
is a solution already provided by Spencer's "regex metasyntax" hack --- see
http://developer.postgresql.org/docs/postgres/functions-matching.html#POSIX-METASYNTAX
That is, one could write something like
foo ~ ('(?b)' || basic_regex_expression)

to force basic_regex_expression to be taken as a BRE and not the
extended syntax.  This is a tad uglier than changing the operator name,
perhaps, but it has some advantages too --- for one, the option could be
plugged into the string further upstream than where the SQL syntax is
determined.

Basically I think the flavor-as-GUC-variable approach is orthogonal to
inventing some new operator names.  We could do the latter too, but
I don't really see a need for it given the metasyntax feature.
        regards, tom lane


Re: Status report: regex replacement

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> > UTF-8 seems to be the most popular, but even XML standard requires all
> > compliant implementations to deal with at least both UTF-8 and UTF-16.
>
> I don't think PostgreSQL is going to natively support UTF-16.

At FOSDEM it was claimed that Windows natively uses UCS-2, and there are
also continuing rumours that the Java Unicode encoding is not quite UTF-8,
so there is going to be a certain pressure to support other Unicode
encodings besides UTF-8.

As for the names, the SQL standard defines most of those.

-- 
Peter Eisentraut   peter_e@gmx.net