Thread: regular expressions from hell

regular expressions from hell

From
Brett McCormick
Date:
I've noticed there are no less then 10^10 regex implementations.
Is there a standard?  Does ANSI have a regexp standard, or is there
a regex standard in the ANSI SQL spec?  What do we use?

Personally, I'm a perl guy, so everytime I have to bend my brain to
some other regex syntax, I get a headache.  As part of my perl PL
package, perl regexps will be included as a set of operators.

Is there interest in the release of perl-style regexp operators for
postgres before the PL is completed?  Note that this requires the
entire perl library to be loaded when the operator is used (possibly
expensive).  But, if you have a shared perl library, this only has to
happen once.

Re: [HACKERS] regular expressions from hell

From
"Thomas G. Lockhart"
Date:
> I've noticed there are no less then 10^10 regex implementations.
> Is there a standard?  Does ANSI have a regexp standard, or is there
> a regex standard in the ANSI SQL spec?  What do we use?

afaik the only regex in ANSI SQL is that implemented for the LIKE
operator. Pretty pathetic: uses "%" for match-all and "_" for match-any
and that's it. Ingres had a bit more, with bracketed character ranges
also. None as rich as what we already have in the backend of Postgres.

Don't know about any other ANSI standards for regex, but I don't know
that there isn't one either...

                       - Tom

Re: [HACKERS] regular expressions from hell

From
dg@illustra.com (David Gould)
Date:
> I've noticed there are no less then 10^10 regex implementations.
> Is there a standard?  Does ANSI have a regexp standard, or is there
> a regex standard in the ANSI SQL spec?  What do we use?

Good question. I think one of the standard unix regex's should be ok. At least
everyone knows how to work it, and they are quite small.

> Personally, I'm a perl guy, so everytime I have to bend my brain to
> some other regex syntax, I get a headache.  As part of my perl PL
> package, perl regexps will be included as a set of operators.
>
> Is there interest in the release of perl-style regexp operators for
> postgres before the PL is completed?  Note that this requires the
> entire perl library to be loaded when the operator is used (possibly
> expensive).  But, if you have a shared perl library, this only has to
> happen once.

Hmmm, I really like the perl regex's, especially the extended syntax, but
I don't want to load a whole perl lib to get this.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Of course, someone who knows more about this will correct me if I'm wrong,
 and someone who knows less will correct me if I'm right."
               --David Palmer (palmer@tybalt.caltech.edu)

Re: [HACKERS] regular expressions from hell

From
Brett McCormick
Date:
Unfortunately, there's no other way.  This is mentioned in the
perlcall manpage, I beleive.  One method which is ok in my book is to
load the shared perl lib once, in one backend, and then it can be
shared between all other backends when they need perl regex's.

There is no mechanism for auto-loading the type/func shared libraries
on postmaster startup correct?  It happens per backend sessions?  So
to do the above you'd have to have one "Dummy" connection which just
did a simple regex and then while(1) { sleep(10^32) };

On Sun, 31 May 1998, at 16:46:30, David Gould wrote:

> Hmmm, I really like the perl regex's, especially the extended syntax, but
> I don't want to load a whole perl lib to get this.
>
> -dg
>
> David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
> Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
> "Of course, someone who knows more about this will correct me if I'm wrong,
>  and someone who knows less will correct me if I'm right."
>                --David Palmer (palmer@tybalt.caltech.edu)
>

Re: [HACKERS] regular expressions from hell

From
Brett McCormick
Date:
Not to mention the fact that if perl (or mod_perl) is already running
(and you're using a shared libperl), the library is already loaded.

On Sun, 31 May 1998, at 17:23:16, Brett McCormick wrote:

> Unfortunately, there's no other way.  This is mentioned in the
> perlcall manpage, I beleive.  One method which is ok in my book is to
> load the shared perl lib once, in one backend, and then it can be
> shared between all other backends when they need perl regex's.
>
> There is no mechanism for auto-loading the type/func shared libraries
> on postmaster startup correct?  It happens per backend sessions?  So
> to do the above you'd have to have one "Dummy" connection which just
> did a simple regex and then while(1) { sleep(10^32) };
>
> On Sun, 31 May 1998, at 16:46:30, David Gould wrote:
>
> > Hmmm, I really like the perl regex's, especially the extended syntax, but
> > I don't want to load a whole perl lib to get this.
> >
> > -dg
> >
> > David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
> > Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
> > "Of course, someone who knows more about this will correct me if I'm wrong,
> >  and someone who knows less will correct me if I'm right."
> >                --David Palmer (palmer@tybalt.caltech.edu)
> >
>

Re: [HACKERS] regular expressions from hell

From
Hal Snyder
Date:
> Date: Sun, 31 May 1998 18:56:29 -0700 (PDT)
> From: Brett McCormick <brett@work.chicken.org>
> Sender: owner-pgsql-hackers@hub.org

> Not to mention the fact that if perl (or mod_perl) is already running
> (and you're using a shared libperl), the library is already loaded.

If you're running Apache, mod_perl or not, isn't Posix regex loaded?
(HSREGEX or compatible?)

> On Sun, 31 May 1998, at 17:23:16, Brett McCormick wrote:
>
> > Unfortunately, there's no other way.  This is mentioned in the
> > perlcall manpage, I beleive.  One method which is ok in my book is to
> > load the shared perl lib once, in one backend, and then it can be
> > shared between all other backends when they need perl regex's.
> >
> > There is no mechanism for auto-loading the type/func shared libraries
> > on postmaster startup correct?  It happens per backend sessions?  So
> > to do the above you'd have to have one "Dummy" connection which just
> > did a simple regex and then while(1) { sleep(10^32) };
...


Re: [HACKERS] regular expressions from hell

From
dg@illustra.com (David Gould)
Date:
>
> Not to mention the fact that if perl (or mod_perl) is already running
> (and you're using a shared libperl), the library is already loaded.

Ok, my vote is to build regexes into the pgsql binary or into a .so that
we distribute. There should be no need to have perl installed on a system
to run postgresql. If we are going to extend the language to improve on
the very lame sql92 like clause, we need to have it be part of the system
that can be counted on, not something you might or might not have depending
on what else is installed.

-dg

David Gould           dg@illustra.com            510.628.3783 or 510.305.9468
Informix Software                      300 Lakeside Drive   Oakland, CA 94612
 - A child of five could understand this!  Fetch me a child of five.

Re: [HACKERS] regular expressions from hell

From
"Jose' Soares Da Silva"
Date:
On Sun, 31 May 1998, Thomas G. Lockhart wrote:

> > I've noticed there are no less then 10^10 regex implementations.
> > Is there a standard?  Does ANSI have a regexp standard, or is there
> > a regex standard in the ANSI SQL spec?  What do we use?
>
> afaik the only regex in ANSI SQL is that implemented for the LIKE
> operator. Pretty pathetic: uses "%" for match-all and "_" for match-any
> and that's it. Ingres had a bit more, with bracketed character ranges
> also. None as rich as what we already have in the backend of Postgres.
>
> Don't know about any other ANSI standards for regex, but I don't know
> that there isn't one either...
>
- SQL3 SIMILAR condition.
SIMILAR is intended for character string pattern matching. The difference
between SIMILAR and LIKE is that SIMILAR supports a much more extensive
range of possibilities ("wild cards," etc.) than LIKE does.
Here the syntax:

          expression [ NOT ] SIMILAR TO pattern [ ESCAPE escape ]

                                                      Jose'


Re: [HACKERS] regular expressions from hell

From
Bruce Momjian
Date:
>
> >
> > Not to mention the fact that if perl (or mod_perl) is already running
> > (and you're using a shared libperl), the library is already loaded.
>
> Ok, my vote is to build regexes into the pgsql binary or into a .so that
> we distribute. There should be no need to have perl installed on a system
> to run postgresql. If we are going to extend the language to improve on
> the very lame sql92 like clause, we need to have it be part of the system
> that can be counted on, not something you might or might not have depending
> on what else is installed.

We already have it as ~, just not with Perl extensions.  Our
implementation is very slow, and the author has said he is working on a
rewrite, though no time frame was given.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] regular expressions from hell

From
Brett McCormick
Date:
On Mon, 1 June 1998, at 10:16:35, Bruce Momjian wrote:

> > Ok, my vote is to build regexes into the pgsql binary or into a .so that
> > we distribute. There should be no need to have perl installed on a system
> > to run postgresql. If we are going to extend the language to improve on
> > the very lame sql92 like clause, we need to have it be part of the system
> > that can be counted on, not something you might or might not have depending
> > on what else is installed.

I'm not suggesting we require perl to be installed to run postgres, or
replace the current regexp implementation with perl.  i was just
lamenting the fact that there are no less than 10 different regexp
implementations, with different metacharacters.  why should I have to
remember one syntax when I use perl, one for sed, one for emacs, and
another for postgresql?  this isn't a problem with postgres per se,
just the fact that there seems to be no standard.

I love perl regex's.  I'm merely suggesting (and planning on
implementing) a different set of regexp operators (not included with
postgres, but as a contrib module) that use perl regex's.  There are
some pros and cons, which have been discussed.

It should be there for people who want it.

>
> We already have it as ~, just not with Perl extensions.  Our
> implementation is very slow, and the author has said he is working on a
> rewrite, though no time frame was given.

Re: [HACKERS] regular expressions from hell

From
The Hermit Hacker
Date:
On Sun, 31 May 1998, David Gould wrote:

> >
> > Not to mention the fact that if perl (or mod_perl) is already running
> > (and you're using a shared libperl), the library is already loaded.
>
> Ok, my vote is to build regexes into the pgsql binary or into a .so that
> we distribute. There should be no need to have perl installed on a system
> to run postgresql. If we are going to extend the language to improve on
> the very lame sql92 like clause, we need to have it be part of the system
> that can be counted on, not something you might or might not have depending
> on what else is installed.

    Odd question here, but how many systems nowadays *don't* have Perl
installed that would be running PostgreSQL?  IMHO, perl is an invaluable
enough tool that I can't imagine a site not running it *shrug*



Re: [HACKERS] regular expressions from hell

From
ocie@paracel.com
Date:
Brett McCormick wrote:
>
> On Mon, 1 June 1998, at 10:16:35, Bruce Momjian wrote:
>
> > > Ok, my vote is to build regexes into the pgsql binary or into a .so that
> > > we distribute. There should be no need to have perl installed on a system
> > > to run postgresql. If we are going to extend the language to improve on
> > > the very lame sql92 like clause, we need to have it be part of the system
> > > that can be counted on, not something you might or might not have depending
> > > on what else is installed.
>
> I'm not suggesting we require perl to be installed to run postgres, or
> replace the current regexp implementation with perl.  i was just
> lamenting the fact that there are no less than 10 different regexp
> implementations, with different metacharacters.  why should I have to
> remember one syntax when I use perl, one for sed, one for emacs, and
> another for postgresql?  this isn't a problem with postgres per se,
> just the fact that there seems to be no standard.

I think most of this is due to different decisions on what needs to be
escaped or not.  For instance, if memory serves, GNU grep treats
parens as metacharacters, which must be escaped with a backslash to
match parens, while in Emacs, parens match parens and must be escaped
to get their meta-character meaning.  Things have gone too far to have
one standard now I'm afraid.

Ocie

Re: [HACKERS] regular expressions from hell

From
Roland Roberts
Date:
-----BEGIN PGP SIGNED MESSAGE-----

>>>>> "ocie" == ocie  <ocie@paracel.com> writes:

    ocie> I think most of this is due to different decisions on what
    ocie> needs to be escaped or not.  For instance, if memory serves,
    ocie> GNU grep treats parens as metacharacters, which must be
    ocie> escaped with a backslash to match parens, while in Emacs,
    ocie> parens match parens and must be escaped to get their
    ocie> meta-character meaning.  Things have gone too far to have
    ocie> one standard now I'm afraid.

Please try to remember that there are historical reasons for some of
this.  grep and egrep behave differently with respect to parentheses;
again, this is historical.

Personally, I like Perl regexps.  And there is a library for Tcl/Tk
(nre) that implements the same syntax for that language.  But I do
like Emacs' syntax tables and character classes.  I can live with
switching back and forth to some extent....

roland

-----BEGIN PGP SIGNATURE-----
Version: 2.6.2
Comment: Processed by Mailcrypt 3.4, an Emacs/PGP interface

iQCVAwUBNXSyLuoW38lmvDvNAQHatQQAsyp+akdXl0TiptXsSlrp7tM2/Jb/jLnW
SfpkYVkk53iER/JMYMU4trfQQssePkqGmaF8GMeU5i8eMW6Vi3Vus2pqovnLa1eV
w5rCgxKXqpZnIhGJZeHIYieMfWxfdmWOUjawrjKv85vBRdZDYdRkLBoAWvI4ZaJb
JxAEwqbZrQw=
=Zgvo
-----END PGP SIGNATURE-----
--
Roland B. Roberts, PhD                  Custom Software Solutions
roberts@panix.com                           101 West 15th St #4NN
                                               New York, NY 10011

Re: [HACKERS] regular expressions from hell

From
dg@illustra.com (David Gould)
Date:
Roland B. Roberts, PhD writes:
> >>>>> "ocie" == ocie  <ocie@paracel.com> writes:
>
>     ocie> I think most of this is due to different decisions on what
>     ocie> needs to be escaped or not.  For instance, if memory serves,
>     ocie> GNU grep treats parens as metacharacters, which must be
>     ocie> escaped with a backslash to match parens, while in Emacs,
>     ocie> parens match parens and must be escaped to get their
>     ocie> meta-character meaning.  Things have gone too far to have
>     ocie> one standard now I'm afraid.
>
> Please try to remember that there are historical reasons for some of
> this.  grep and egrep behave differently with respect to parentheses;
> again, this is historical.
>
> Personally, I like Perl regexps.  And there is a library for Tcl/Tk
> (nre) that implements the same syntax for that language.  But I do
> like Emacs' syntax tables and character classes.  I can live with
> switching back and forth to some extent....

Emacs! Huh! I like VI regexes... Uh oh, sorry, wrong flamewar.

Isn't there a POSIX regex? Perhaps we could consider that, unless of course
it is well and truly broken.

Secondly, I seem to remember a post here in this same thread that said
we already had regexes. Perhaps we should move on.

Seriously as part of a Perl extension to postgresql, perl regexes would
be the naturaly thing. But if we already have a regex package, I think
adding just perl regexes without perl, but requireing perl.so is uhmmm,
premature.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Don't worry about people stealing your ideas.  If your ideas are any
 good, you'll have to ram them down people's throats." -- Howard Aiken

Re: [HACKERS] regular expressions from hell

From
dg@illustra.com (David Gould)
Date:
> I've noticed there are no less then 10^10 regex implementations.
> Is there a standard?  Does ANSI have a regexp standard, or is there
> a regex standard in the ANSI SQL spec?  What do we use?
>
> Personally, I'm a perl guy, so everytime I have to bend my brain to
> some other regex syntax, I get a headache.  As part of my perl PL
> package, perl regexps will be included as a set of operators.
>
> Is there interest in the release of perl-style regexp operators for
> postgres before the PL is completed?  Note that this requires the
> entire perl library to be loaded when the operator is used (possibly
> expensive).  But, if you have a shared perl library, this only has to
> happen once.

Well, not to bring this up for discussion again, but there is apparently
a Posix standard, and even better a free implementation:


Article 10705 of comp.os.linux.misc:
Newsgroups: gnu.announce,gnu.utils.bug,comp.os.linux.misc,alt.sources.d
Subject: Rx 1.9
Date: Wed, 10 Jun 1998 10:40:00 -0700 (PDT)
Approved: info-gnu@gnu.org

The latest version of Rx, 1.9, is available on the web at:

    http://users.lanminds.com/~lord
    ftp://emf.net/users/lord/src/rx-1.9.tar.gz
 and at ftp://ftp.gnu.org/pub/gnu/rx-1.9.tar.gz and mirrors of that
                                                site (see list below).

Rx is a regexp pattern matching library. The library exports these
functions which are standardized by Posix:

     regcomp         - compile a regexp
     regexec         - search for a match
     regfree         - release storage for a regexp
     regerr          - translate error codes to strings

The library exports many other functions as well, and does a lot
more than Posix requires.

                RECENT CHANGES

1. Rx 1.9
   Recent changes: More "dead code" was recently discarded,
           and the remaining code simplified.

           Benchmark comparisons to GNU regex and older
           versions of Rx were added to the distribution.

0. Rx 1.8
   Recent changes: Various bug-fixes and small performance improvements.
           A great deal of "dead code" was recently discarded,
           making the size of the Rx library smaller and the
           source easier to maintain (in theory).


[ Most GNU software is compressed using the GNU `gzip' compression program.
  Source code is available on most sites distributing GNU software.
  Executables for various systems and information about using gzip can be
  found at the URL http://www.gzip.org.

  For information on how to order GNU software on CD-ROM and
  printed GNU manuals, see http://www.gnu.org/order/order.html
  or e-mail a request to: gnu@gnu.org

  By ordering your GNU software from the FSF, you help us continue to
  develop more free software.  Media revenues are our primary source of
  support.  Donations to FSF are deductible on US tax returns.

  The above software will soon be at these ftp sites as well.
  Please try them before ftp.gnu.org as ftp.gnu.org is very busy!
  A possibly more up-to-date list is at the URL
        http://www.gnu.org/order/ftp.html

  thanx -gnu@gnu.org

  Here are the mirrored ftp sites for the GNU Project, listed by country:



  United States:

  California - labrea.stanford.edu/pub/gnu, gatekeeper.dec.com/pub/GNU
  Hawaii - ftp.hawaii.edu/mirrors/gnu
  Illinois - uiarchive.cso.uiuc.edu/pub/gnu (Internet address 128.174.5.14)
  Kentucky -  ftp.ms.uky.edu/pub/gnu
  Maryland - ftp.digex.net/pub/gnu (Internet address 164.109.10.23)
  Michigan - gnu.egr.msu.edu/pub/gnu
  Missouri - wuarchive.wustl.edu/systems/gnu
  New York - ftp.cs.columbia.edu/archives/gnu/prep
  Ohio - ftp.cis.ohio-state.edu/mirror/gnu
  Utah - jaguar.utah.edu/gnustuff
  Virginia - ftp.uu.net/archive/systems/gnu

  Africa:

  South Africa - ftp.sun.ac.za/pub/gnu

  The Americas:

  Brazil - ftp.unicamp.br/pub/gnu
  Canada - ftp.cs.ubc.ca/mirror2/gnu
  Chile - ftp.inf.utfsm.cl/pub/gnu (Internet address 146.83.198.3)
  Costa Rica - sunsite.ulatina.ac.cr/GNU
  Mexico - ftp.uaem.mx/pub/gnu

  Asia and Australia:

  Australia - archie.au/gnu (archie.oz or archie.oz.au for ACSnet)
  Australia - ftp.progsoc.uts.edu.au/pub/gnu
  Japan - tron.um.u-tokyo.ac.jp/pub/GNU/prep
  Japan - ftp.cs.titech.ac.jp/pub/gnu
  Korea - cair-archive.kaist.ac.kr/pub/gnu (Internet address 143.248.186.3)
  Thailand - ftp.nectec.or.th/pub/mirrors/gnu (Internet address - 192.150.251.32)

  Europe:

  Austria - ftp.univie.ac.at/packages/gnu
  Czech Republic - ftp.fi.muni.cz/pub/gnu/
  Denmark - ftp.denet.dk/mirror/ftp.gnu.org/pub/gnu
  Finland - ftp.funet.fi/pub/gnu (Internet address 128.214.6.100)
  France - ftp.univ-lyon1.fr/pub/gnu
  France - ftp.irisa.fr/pub/gnu
  Germany - ftp.informatik.tu-muenchen.de/pub/comp/os/unix/gnu/
  Germany - ftp.informatik.rwth-aachen.de/pub/gnu
  Germany - ftp.de.uu.net/pub/gnu
  Greece - ftp.ntua.gr/pub/gnu
  Greece - ftp.aua.gr/pub/mirrors/GNU (Internet address 143.233.187.61)
  Ireland - ftp.ieunet.ie/pub/gnu (Internet address 192.111.39.1)
  Netherlands - ftp.eu.net/gnu (Internet address 192.16.202.1)
  Netherlands - ftp.nluug.nl/pub/gnu
  Netherlands - ftp.win.tue.nl/pub/gnu (Internet address 131.155.70.100)
  Norway - ugle.unit.no/pub/gnu (Internet address 129.241.1.97)
  Spain - ftp.etsimo.uniovi.es/pub/gnu
  Sweden - ftp.isy.liu.se/pub/gnu
  Sweden - ftp.stacken.kth.se
  Sweden - ftp.luth.se/pub/unix/gnu
  Sweden - ftp.sunet.se/pub/gnu (Internet address 130.238.127.3)
         Also mirrors the Mailing List Archives.
  Switzerland - ftp.eunet.ch/mirrors4/gnu
  Switzerland - sunsite.cnlab-switch.ch/mirror/gnu (Internet address 193.5.24.1)
  United Kingdom - ftp.mcc.ac.uk/pub/gnu (Internet address 130.88.203.12)
  United Kingdom - unix.hensa.ac.uk/mirrors/gnu
  United Kingdom - ftp.warwick.ac.uk (Internet address 137.205.192.14)
  United Kingdom - SunSITE.doc.ic.ac.uk/gnu (Internet address 193.63.255.4)

]

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Don't worry about people stealing your ideas.  If your ideas are any
 good, you'll have to ram them down people's throats." -- Howard Aiken

Re: [HACKERS] regular expressions from hell

From
The Hermit Hacker
Date:
On Thu, 11 Jun 1998, David Gould wrote:

> Article 10705 of comp.os.linux.misc:
> Newsgroups: gnu.announce,gnu.utils.bug,comp.os.linux.misc,alt.sources.d
> Subject: Rx 1.9
> Date: Wed, 10 Jun 1998 10:40:00 -0700 (PDT)
> Approved: info-gnu@gnu.org
>
> The latest version of Rx, 1.9, is available on the web at:
>
>     http://users.lanminds.com/~lord
>     ftp://emf.net/users/lord/src/rx-1.9.tar.gz
>  and at ftp://ftp.gnu.org/pub/gnu/rx-1.9.tar.gz and mirrors of that
>                                                 site (see list below).

The reason that we do not use this particular Regex package is that *it*
falls under the "Almighty GPL", which conflicts with our Berkeley
Copyright...

Now, is there is a standardized spec on this, though, what would it take
to change our Regex to follow it, *without* the risk of tainting our code
with GPLd code?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org