Thread: regular expressions from hell
I've noticed there are no less then 10^10 regex implementations. Is there a standard? Does ANSI have a regexp standard, or is there a regex standard in the ANSI SQL spec? What do we use? Personally, I'm a perl guy, so everytime I have to bend my brain to some other regex syntax, I get a headache. As part of my perl PL package, perl regexps will be included as a set of operators. Is there interest in the release of perl-style regexp operators for postgres before the PL is completed? Note that this requires the entire perl library to be loaded when the operator is used (possibly expensive). But, if you have a shared perl library, this only has to happen once.
> I've noticed there are no less then 10^10 regex implementations. > Is there a standard? Does ANSI have a regexp standard, or is there > a regex standard in the ANSI SQL spec? What do we use? afaik the only regex in ANSI SQL is that implemented for the LIKE operator. Pretty pathetic: uses "%" for match-all and "_" for match-any and that's it. Ingres had a bit more, with bracketed character ranges also. None as rich as what we already have in the backend of Postgres. Don't know about any other ANSI standards for regex, but I don't know that there isn't one either... - Tom
> I've noticed there are no less then 10^10 regex implementations. > Is there a standard? Does ANSI have a regexp standard, or is there > a regex standard in the ANSI SQL spec? What do we use? Good question. I think one of the standard unix regex's should be ok. At least everyone knows how to work it, and they are quite small. > Personally, I'm a perl guy, so everytime I have to bend my brain to > some other regex syntax, I get a headache. As part of my perl PL > package, perl regexps will be included as a set of operators. > > Is there interest in the release of perl-style regexp operators for > postgres before the PL is completed? Note that this requires the > entire perl library to be loaded when the operator is used (possibly > expensive). But, if you have a shared perl library, this only has to > happen once. Hmmm, I really like the perl regex's, especially the extended syntax, but I don't want to load a whole perl lib to get this. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Of course, someone who knows more about this will correct me if I'm wrong, and someone who knows less will correct me if I'm right." --David Palmer (palmer@tybalt.caltech.edu)
Unfortunately, there's no other way. This is mentioned in the perlcall manpage, I beleive. One method which is ok in my book is to load the shared perl lib once, in one backend, and then it can be shared between all other backends when they need perl regex's. There is no mechanism for auto-loading the type/func shared libraries on postmaster startup correct? It happens per backend sessions? So to do the above you'd have to have one "Dummy" connection which just did a simple regex and then while(1) { sleep(10^32) }; On Sun, 31 May 1998, at 16:46:30, David Gould wrote: > Hmmm, I really like the perl regex's, especially the extended syntax, but > I don't want to load a whole perl lib to get this. > > -dg > > David Gould dg@illustra.com 510.628.3783 or 510.305.9468 > Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 > "Of course, someone who knows more about this will correct me if I'm wrong, > and someone who knows less will correct me if I'm right." > --David Palmer (palmer@tybalt.caltech.edu) >
Not to mention the fact that if perl (or mod_perl) is already running (and you're using a shared libperl), the library is already loaded. On Sun, 31 May 1998, at 17:23:16, Brett McCormick wrote: > Unfortunately, there's no other way. This is mentioned in the > perlcall manpage, I beleive. One method which is ok in my book is to > load the shared perl lib once, in one backend, and then it can be > shared between all other backends when they need perl regex's. > > There is no mechanism for auto-loading the type/func shared libraries > on postmaster startup correct? It happens per backend sessions? So > to do the above you'd have to have one "Dummy" connection which just > did a simple regex and then while(1) { sleep(10^32) }; > > On Sun, 31 May 1998, at 16:46:30, David Gould wrote: > > > Hmmm, I really like the perl regex's, especially the extended syntax, but > > I don't want to load a whole perl lib to get this. > > > > -dg > > > > David Gould dg@illustra.com 510.628.3783 or 510.305.9468 > > Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 > > "Of course, someone who knows more about this will correct me if I'm wrong, > > and someone who knows less will correct me if I'm right." > > --David Palmer (palmer@tybalt.caltech.edu) > > >
> Date: Sun, 31 May 1998 18:56:29 -0700 (PDT) > From: Brett McCormick <brett@work.chicken.org> > Sender: owner-pgsql-hackers@hub.org > Not to mention the fact that if perl (or mod_perl) is already running > (and you're using a shared libperl), the library is already loaded. If you're running Apache, mod_perl or not, isn't Posix regex loaded? (HSREGEX or compatible?) > On Sun, 31 May 1998, at 17:23:16, Brett McCormick wrote: > > > Unfortunately, there's no other way. This is mentioned in the > > perlcall manpage, I beleive. One method which is ok in my book is to > > load the shared perl lib once, in one backend, and then it can be > > shared between all other backends when they need perl regex's. > > > > There is no mechanism for auto-loading the type/func shared libraries > > on postmaster startup correct? It happens per backend sessions? So > > to do the above you'd have to have one "Dummy" connection which just > > did a simple regex and then while(1) { sleep(10^32) }; ...
> > Not to mention the fact that if perl (or mod_perl) is already running > (and you're using a shared libperl), the library is already loaded. Ok, my vote is to build regexes into the pgsql binary or into a .so that we distribute. There should be no need to have perl installed on a system to run postgresql. If we are going to extend the language to improve on the very lame sql92 like clause, we need to have it be part of the system that can be counted on, not something you might or might not have depending on what else is installed. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software 300 Lakeside Drive Oakland, CA 94612 - A child of five could understand this! Fetch me a child of five.
On Sun, 31 May 1998, Thomas G. Lockhart wrote: > > I've noticed there are no less then 10^10 regex implementations. > > Is there a standard? Does ANSI have a regexp standard, or is there > > a regex standard in the ANSI SQL spec? What do we use? > > afaik the only regex in ANSI SQL is that implemented for the LIKE > operator. Pretty pathetic: uses "%" for match-all and "_" for match-any > and that's it. Ingres had a bit more, with bracketed character ranges > also. None as rich as what we already have in the backend of Postgres. > > Don't know about any other ANSI standards for regex, but I don't know > that there isn't one either... > - SQL3 SIMILAR condition. SIMILAR is intended for character string pattern matching. The difference between SIMILAR and LIKE is that SIMILAR supports a much more extensive range of possibilities ("wild cards," etc.) than LIKE does. Here the syntax: expression [ NOT ] SIMILAR TO pattern [ ESCAPE escape ] Jose'
> > > > > Not to mention the fact that if perl (or mod_perl) is already running > > (and you're using a shared libperl), the library is already loaded. > > Ok, my vote is to build regexes into the pgsql binary or into a .so that > we distribute. There should be no need to have perl installed on a system > to run postgresql. If we are going to extend the language to improve on > the very lame sql92 like clause, we need to have it be part of the system > that can be counted on, not something you might or might not have depending > on what else is installed. We already have it as ~, just not with Perl extensions. Our implementation is very slow, and the author has said he is working on a rewrite, though no time frame was given. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Mon, 1 June 1998, at 10:16:35, Bruce Momjian wrote: > > Ok, my vote is to build regexes into the pgsql binary or into a .so that > > we distribute. There should be no need to have perl installed on a system > > to run postgresql. If we are going to extend the language to improve on > > the very lame sql92 like clause, we need to have it be part of the system > > that can be counted on, not something you might or might not have depending > > on what else is installed. I'm not suggesting we require perl to be installed to run postgres, or replace the current regexp implementation with perl. i was just lamenting the fact that there are no less than 10 different regexp implementations, with different metacharacters. why should I have to remember one syntax when I use perl, one for sed, one for emacs, and another for postgresql? this isn't a problem with postgres per se, just the fact that there seems to be no standard. I love perl regex's. I'm merely suggesting (and planning on implementing) a different set of regexp operators (not included with postgres, but as a contrib module) that use perl regex's. There are some pros and cons, which have been discussed. It should be there for people who want it. > > We already have it as ~, just not with Perl extensions. Our > implementation is very slow, and the author has said he is working on a > rewrite, though no time frame was given.
On Sun, 31 May 1998, David Gould wrote: > > > > Not to mention the fact that if perl (or mod_perl) is already running > > (and you're using a shared libperl), the library is already loaded. > > Ok, my vote is to build regexes into the pgsql binary or into a .so that > we distribute. There should be no need to have perl installed on a system > to run postgresql. If we are going to extend the language to improve on > the very lame sql92 like clause, we need to have it be part of the system > that can be counted on, not something you might or might not have depending > on what else is installed. Odd question here, but how many systems nowadays *don't* have Perl installed that would be running PostgreSQL? IMHO, perl is an invaluable enough tool that I can't imagine a site not running it *shrug*
Brett McCormick wrote: > > On Mon, 1 June 1998, at 10:16:35, Bruce Momjian wrote: > > > > Ok, my vote is to build regexes into the pgsql binary or into a .so that > > > we distribute. There should be no need to have perl installed on a system > > > to run postgresql. If we are going to extend the language to improve on > > > the very lame sql92 like clause, we need to have it be part of the system > > > that can be counted on, not something you might or might not have depending > > > on what else is installed. > > I'm not suggesting we require perl to be installed to run postgres, or > replace the current regexp implementation with perl. i was just > lamenting the fact that there are no less than 10 different regexp > implementations, with different metacharacters. why should I have to > remember one syntax when I use perl, one for sed, one for emacs, and > another for postgresql? this isn't a problem with postgres per se, > just the fact that there seems to be no standard. I think most of this is due to different decisions on what needs to be escaped or not. For instance, if memory serves, GNU grep treats parens as metacharacters, which must be escaped with a backslash to match parens, while in Emacs, parens match parens and must be escaped to get their meta-character meaning. Things have gone too far to have one standard now I'm afraid. Ocie
-----BEGIN PGP SIGNED MESSAGE----- >>>>> "ocie" == ocie <ocie@paracel.com> writes: ocie> I think most of this is due to different decisions on what ocie> needs to be escaped or not. For instance, if memory serves, ocie> GNU grep treats parens as metacharacters, which must be ocie> escaped with a backslash to match parens, while in Emacs, ocie> parens match parens and must be escaped to get their ocie> meta-character meaning. Things have gone too far to have ocie> one standard now I'm afraid. Please try to remember that there are historical reasons for some of this. grep and egrep behave differently with respect to parentheses; again, this is historical. Personally, I like Perl regexps. And there is a library for Tcl/Tk (nre) that implements the same syntax for that language. But I do like Emacs' syntax tables and character classes. I can live with switching back and forth to some extent.... roland -----BEGIN PGP SIGNATURE----- Version: 2.6.2 Comment: Processed by Mailcrypt 3.4, an Emacs/PGP interface iQCVAwUBNXSyLuoW38lmvDvNAQHatQQAsyp+akdXl0TiptXsSlrp7tM2/Jb/jLnW SfpkYVkk53iER/JMYMU4trfQQssePkqGmaF8GMeU5i8eMW6Vi3Vus2pqovnLa1eV w5rCgxKXqpZnIhGJZeHIYieMfWxfdmWOUjawrjKv85vBRdZDYdRkLBoAWvI4ZaJb JxAEwqbZrQw= =Zgvo -----END PGP SIGNATURE----- -- Roland B. Roberts, PhD Custom Software Solutions roberts@panix.com 101 West 15th St #4NN New York, NY 10011
Roland B. Roberts, PhD writes: > >>>>> "ocie" == ocie <ocie@paracel.com> writes: > > ocie> I think most of this is due to different decisions on what > ocie> needs to be escaped or not. For instance, if memory serves, > ocie> GNU grep treats parens as metacharacters, which must be > ocie> escaped with a backslash to match parens, while in Emacs, > ocie> parens match parens and must be escaped to get their > ocie> meta-character meaning. Things have gone too far to have > ocie> one standard now I'm afraid. > > Please try to remember that there are historical reasons for some of > this. grep and egrep behave differently with respect to parentheses; > again, this is historical. > > Personally, I like Perl regexps. And there is a library for Tcl/Tk > (nre) that implements the same syntax for that language. But I do > like Emacs' syntax tables and character classes. I can live with > switching back and forth to some extent.... Emacs! Huh! I like VI regexes... Uh oh, sorry, wrong flamewar. Isn't there a POSIX regex? Perhaps we could consider that, unless of course it is well and truly broken. Secondly, I seem to remember a post here in this same thread that said we already had regexes. Perhaps we should move on. Seriously as part of a Perl extension to postgresql, perl regexes would be the naturaly thing. But if we already have a regex package, I think adding just perl regexes without perl, but requireing perl.so is uhmmm, premature. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats." -- Howard Aiken
> I've noticed there are no less then 10^10 regex implementations. > Is there a standard? Does ANSI have a regexp standard, or is there > a regex standard in the ANSI SQL spec? What do we use? > > Personally, I'm a perl guy, so everytime I have to bend my brain to > some other regex syntax, I get a headache. As part of my perl PL > package, perl regexps will be included as a set of operators. > > Is there interest in the release of perl-style regexp operators for > postgres before the PL is completed? Note that this requires the > entire perl library to be loaded when the operator is used (possibly > expensive). But, if you have a shared perl library, this only has to > happen once. Well, not to bring this up for discussion again, but there is apparently a Posix standard, and even better a free implementation: Article 10705 of comp.os.linux.misc: Newsgroups: gnu.announce,gnu.utils.bug,comp.os.linux.misc,alt.sources.d Subject: Rx 1.9 Date: Wed, 10 Jun 1998 10:40:00 -0700 (PDT) Approved: info-gnu@gnu.org The latest version of Rx, 1.9, is available on the web at: http://users.lanminds.com/~lord ftp://emf.net/users/lord/src/rx-1.9.tar.gz and at ftp://ftp.gnu.org/pub/gnu/rx-1.9.tar.gz and mirrors of that site (see list below). Rx is a regexp pattern matching library. The library exports these functions which are standardized by Posix: regcomp - compile a regexp regexec - search for a match regfree - release storage for a regexp regerr - translate error codes to strings The library exports many other functions as well, and does a lot more than Posix requires. RECENT CHANGES 1. Rx 1.9 Recent changes: More "dead code" was recently discarded, and the remaining code simplified. Benchmark comparisons to GNU regex and older versions of Rx were added to the distribution. 0. Rx 1.8 Recent changes: Various bug-fixes and small performance improvements. A great deal of "dead code" was recently discarded, making the size of the Rx library smaller and the source easier to maintain (in theory). [ Most GNU software is compressed using the GNU `gzip' compression program. Source code is available on most sites distributing GNU software. Executables for various systems and information about using gzip can be found at the URL http://www.gzip.org. For information on how to order GNU software on CD-ROM and printed GNU manuals, see http://www.gnu.org/order/order.html or e-mail a request to: gnu@gnu.org By ordering your GNU software from the FSF, you help us continue to develop more free software. Media revenues are our primary source of support. Donations to FSF are deductible on US tax returns. The above software will soon be at these ftp sites as well. Please try them before ftp.gnu.org as ftp.gnu.org is very busy! A possibly more up-to-date list is at the URL http://www.gnu.org/order/ftp.html thanx -gnu@gnu.org Here are the mirrored ftp sites for the GNU Project, listed by country: United States: California - labrea.stanford.edu/pub/gnu, gatekeeper.dec.com/pub/GNU Hawaii - ftp.hawaii.edu/mirrors/gnu Illinois - uiarchive.cso.uiuc.edu/pub/gnu (Internet address 128.174.5.14) Kentucky - ftp.ms.uky.edu/pub/gnu Maryland - ftp.digex.net/pub/gnu (Internet address 164.109.10.23) Michigan - gnu.egr.msu.edu/pub/gnu Missouri - wuarchive.wustl.edu/systems/gnu New York - ftp.cs.columbia.edu/archives/gnu/prep Ohio - ftp.cis.ohio-state.edu/mirror/gnu Utah - jaguar.utah.edu/gnustuff Virginia - ftp.uu.net/archive/systems/gnu Africa: South Africa - ftp.sun.ac.za/pub/gnu The Americas: Brazil - ftp.unicamp.br/pub/gnu Canada - ftp.cs.ubc.ca/mirror2/gnu Chile - ftp.inf.utfsm.cl/pub/gnu (Internet address 146.83.198.3) Costa Rica - sunsite.ulatina.ac.cr/GNU Mexico - ftp.uaem.mx/pub/gnu Asia and Australia: Australia - archie.au/gnu (archie.oz or archie.oz.au for ACSnet) Australia - ftp.progsoc.uts.edu.au/pub/gnu Japan - tron.um.u-tokyo.ac.jp/pub/GNU/prep Japan - ftp.cs.titech.ac.jp/pub/gnu Korea - cair-archive.kaist.ac.kr/pub/gnu (Internet address 143.248.186.3) Thailand - ftp.nectec.or.th/pub/mirrors/gnu (Internet address - 192.150.251.32) Europe: Austria - ftp.univie.ac.at/packages/gnu Czech Republic - ftp.fi.muni.cz/pub/gnu/ Denmark - ftp.denet.dk/mirror/ftp.gnu.org/pub/gnu Finland - ftp.funet.fi/pub/gnu (Internet address 128.214.6.100) France - ftp.univ-lyon1.fr/pub/gnu France - ftp.irisa.fr/pub/gnu Germany - ftp.informatik.tu-muenchen.de/pub/comp/os/unix/gnu/ Germany - ftp.informatik.rwth-aachen.de/pub/gnu Germany - ftp.de.uu.net/pub/gnu Greece - ftp.ntua.gr/pub/gnu Greece - ftp.aua.gr/pub/mirrors/GNU (Internet address 143.233.187.61) Ireland - ftp.ieunet.ie/pub/gnu (Internet address 192.111.39.1) Netherlands - ftp.eu.net/gnu (Internet address 192.16.202.1) Netherlands - ftp.nluug.nl/pub/gnu Netherlands - ftp.win.tue.nl/pub/gnu (Internet address 131.155.70.100) Norway - ugle.unit.no/pub/gnu (Internet address 129.241.1.97) Spain - ftp.etsimo.uniovi.es/pub/gnu Sweden - ftp.isy.liu.se/pub/gnu Sweden - ftp.stacken.kth.se Sweden - ftp.luth.se/pub/unix/gnu Sweden - ftp.sunet.se/pub/gnu (Internet address 130.238.127.3) Also mirrors the Mailing List Archives. Switzerland - ftp.eunet.ch/mirrors4/gnu Switzerland - sunsite.cnlab-switch.ch/mirror/gnu (Internet address 193.5.24.1) United Kingdom - ftp.mcc.ac.uk/pub/gnu (Internet address 130.88.203.12) United Kingdom - unix.hensa.ac.uk/mirrors/gnu United Kingdom - ftp.warwick.ac.uk (Internet address 137.205.192.14) United Kingdom - SunSITE.doc.ic.ac.uk/gnu (Internet address 193.63.255.4) ] -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats." -- Howard Aiken
On Thu, 11 Jun 1998, David Gould wrote: > Article 10705 of comp.os.linux.misc: > Newsgroups: gnu.announce,gnu.utils.bug,comp.os.linux.misc,alt.sources.d > Subject: Rx 1.9 > Date: Wed, 10 Jun 1998 10:40:00 -0700 (PDT) > Approved: info-gnu@gnu.org > > The latest version of Rx, 1.9, is available on the web at: > > http://users.lanminds.com/~lord > ftp://emf.net/users/lord/src/rx-1.9.tar.gz > and at ftp://ftp.gnu.org/pub/gnu/rx-1.9.tar.gz and mirrors of that > site (see list below). The reason that we do not use this particular Regex package is that *it* falls under the "Almighty GPL", which conflicts with our Berkeley Copyright... Now, is there is a standardized spec on this, though, what would it take to change our Regex to follow it, *without* the risk of tainting our code with GPLd code? Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org