Re: Regular Expression INTRO [long] [OT] - Mailing list pgsql-general

From Bruce Momjian
Subject Re: Regular Expression INTRO [long] [OT]
Date
Msg-id 200208260035.g7Q0ZTr25111@candle.pha.pa.us
Whole thread Raw
In response to Regular Expression INTRO [long] [OT]  (will trillich <will@serensoft.com>)
List pgsql-general
Justin, have added this to techdocs?  I have a followup email here with
some clarifications.

---------------------------------------------------------------------------

will trillich wrote:
> wow -- this was nearly a year ago and i still haven't
> done it!
>
> Justin Clift wrote:
> > Would you mind taking a few moments and writing a guide on using Regex
> > functions in PostgreSQL?
> >
> > This is to put on the techdocs.postgresql.org website so everyone in the
> > PostgreSQL community will have access to it.
>
> i'm posting to pgsql-general so the community
> can help find holes and plug them...
>
> here goes --
>
>
>
>
>
>
> Finding data in PostgreSQL using LIKE or ~ (REGULAR EXPRESSIONS):
>
>
> LIKE
> ====
>
> You're probably familiar with the WHERE clause
> of the SELECT statement --
>
>     SELECT somefield FROM table WHERE somefield = somevalue;
>
> For example, to show all customers within the
> post code "60201" you could try
>
>     SELECT * FROM cust WHERE zip = '60201';
>
> But if you also have nine-digit zips stored in the zip
> field (in the U.S.A. we have standard five-digit postal
> codes and also a four-digit extension for further
> geographic resolution) then this search will OMIT
> customers having more-specific zips such as '60201-9876'
> or '60201-1234'.
>
>
> The Percent "%"
> ---------------
>
> Try using LIKE instead of = (EQUALS):
>
>     SELECT * FROM cust WHERE zip LIKE '60201%';
>
> With LIKE, your search values can use the percent (%)
> to tell PostgreSQL that anything can occupy that spot
> -- one character, twelve symbols or zero digits -- and
> still satisfy the search.
>
> On Unix or Linux, this is basically the same as the
> ASTERISK (*) at the command line, when dealing with
> file names:
>
>     # list all files whose names begin with '.bash'
>     ls .bash*
>
>     # list all files containing 'out' anywhere in the name
>     ls *out*
>
>     # list all file names ending with '.pl'
>     ls *.pl
>
>     # list file starting with 'proj', ending with '.c'
>     ls proj*.c
>
> With PostgreSQL using the LIKE operator, use the
> percent, instead:
>
>     -- list all customers within the 47610 postal code
>     SELECT * FROM cust WHERE zip LIKE '47610%';
>
>     -- display customers who have 'Corp' in their names
>     SELECT * FROM cust WHERE name LIKE '%Corp%';
>
>     -- show customers whose names end in 'LLC'
>     SELECT * FROM cust WHERE name LIKE '%LLC';
>
>     -- documents beginning with 'We', ending with 'rica'
>     SELECT * FROM doc WHERE contents LIKE 'We%rica';
>
> Wherever the '%' appears (using the LIKE operator)
> Postgres allows anything at all to match -- from a
> lengthy string of text, to one single character, to
> a zero-length string -- i.e. nothing at all.
>
>     ...ILIKE 'A%Z'
>     -- matches 'AZ' and 'A to Z' and 'ABC2468XYZ' and 'A@$*Z'
>
>
> The Underscore "_"
> ------------------
>
> For the LIKE operator, the UNDERSCORE (_) takes on the same
> meaning as the question mark does to shell operations for
> Unix and Linux file names:
>
>     # list files starting with dot, followed by at least two chars
>     ls .??*
>
> The underscore matches just one character exactly:
>
>     SELECT * FROM atable WHERE afield LIKE '_';
>     -- shows records where afield is exactly one character
>     -- omitting ones where it's blank or has two (or more)
>     -- characters in it)
>
> You can restrict your searches to finding fields
> of certain lengths this way:
>
>     SELECT * FROM cust WHERE zip LIKE '_____' OR zip LIKE '_________';
>     -- display all five- or nine-character zip codes
>
> Combine it with percent to find fields over a certain length:
>
>     SELECT * FROM atable WHERE afield LIKE '___%';
>     -- display records where afield has three or more characters
>
> Or you can accomodate some language idiosyncracies, as
> well:
>
>     SELECT * FROM activity WHERE venue LIKE 'Theat__';
>     -- find u.k. 'theatre' and u.s. 'theater'
>
>
> ILIKE
> =====
>
> But what if a customer's name is all capitals, as in
> 'THE ACME CORPORATION'? The "...LIKE '%Corp%'" won't
> find it, because uppercase 'O' is not identical to
> lowercase 'o'!
>
> Finding text independent of uppercase/lowercase
> is something we often need to do. And so we have
> the ILIKE operator.
>
> Use ILIKE instead of LIKE when your search should
> ignore case -- the "I" stands for case-[I]nsensitive:
>
>     -- show all customers with 'corp' in the name
>     SELECT * FROM cust WHERE name ILIKE '%Corp%';
>     -- ignoring whether the field is upper- or lowercase
>
> That'll find 'Bubba Gump Shrimp Corp' and 'ACME CORP'
> and even 'Amalgamated Switch and Relay corporation'
> because anything can ('%') precede or ('%') follow the
> string 'Corp' which can be uppercase or lowercase.
>
> But there's even more power in "regular expressions"!
> See below...
>
>
> Notes
> -----
>
> Always include some real data to search for!
>
>     -- show everything, doing a lot of unnecessary work:
>     SELECT * FROM atable WHERE afield LIKE '%';
>
> It would be rather silly to make PostgreSQL compare
> every record to see if it matched 'anything goes,
> in "afield"'. Just say no.
>
> Convert your spoken-language request to a LIKE
> clause like this:
>
>     "afield ENDS WITH avalue"
>     reword => "afield starts with anything, ending with avalue"
>     sql => "... WHERE afield LIKE '%avalue'"
>
>     "afield STARTS WITH avalue"
>     reword => "afield starts with avalue, ending with anything"
>     sql => "... WHERE afield LIKE 'avalue%'"
>
> See how that works?
>
>     "afield CONTAINS avalue"
>     sql => "... WHERE afield LIKE '%avalue%'"
>
> If you're looking for something at the beginning of
> a field, that's where your data goes, and you END with
> the percent. If you're looking for something at the end
> of a field, put your data there and allow anything at
> the front by putting the percent there.
>
> ALSO -- if your field is indexed, you'll defeat the index
> unless you anchor your search to the beginning of the field.
>
>     -- find fields starting with 'something'
>     ... afield LIKE 'something%'; -- uses afield's index
>
> The example below isn't anchored at the start of the field,
> so it does no good to use the index:
>
>     -- find fields containing 'pooh'
>     ... afield LIKE '%pooh%'; -- can't use afield's index
>
> Afield must contain 'pooh' BUT it could be anywhere in
> the field. The index would be useless.
>
>
> LIMITATION
> ==========
>
> Okay. Let's say you have a full-name field that has
> first, middle, and last name all in the one field.
> Sometimes you'll have a middle name, sometimes
> you won't. Sometimes the first name will be spelled
> out, other times it'll be shortened or even abbreviated
> to just the first initial.
>
> Here's how you'd search that field for "Abraham Lincoln"
> using the LIKE operator:
>
>     SELECT * FROM peron WHERE fullname LIKE 'A%Lincoln';
>
> It'll find 'Abraham Lincoln' and 'A. Lincoln' and
> 'Abe Lincoln' because anything can go between the
> initial 'A' and the ending 'Lincoln'.
>
> But -- it'll also match 'Andrew M. Lincoln' and 'Abner
> Sasquatch Lincoln' as well... and this is why some bright
> souls created regular expressions!
>
>
> REGULAR EXPRESSIONS
> ===================
>
> A regular expression is a pattern to search for.
> The structure of the pattern has to match a
> rigid set of rules so that PostgreSQL will know
> what you're trying to search for.
>
> This introduction will barely scratch the surface,
> but hopefully it'll get you started--
>
> If you've ever tinkered with perl, you've probably
> encountered regular expressions:
>
>     perl -ne 'print if /[A-Za-z]/;' somefile
>
> Here, the "[A-Za-z]" is a regular expression.
> It matches any line containing uppercase (A-Z)
> or lowercase (a-z) letters. In this example, the
> lines will come from the file "somefile" and
> lines that match will be printed out.
>
> Well, PostgreSQL can do that, too!
>
>
> Example
> -------
>
>     SELECT * FROM person
>     WHERE fullname ~ 'A(be|braham|\\.) *Lincoln';
>
> The REGULAR EXPRESSION operator is the TILDE (~).
> The value you're searching for then must be a valid
> regular expression. In the above example, we're asking
> PostgreSQL to display all rows containing
>
>     'A'
>
> followed by EITHER
>
>     'be'
> OR
>     'braham'
> OR
>     '\\.' -- a dot
>
> which is then followed by
>
>     ' ' -- a space
>     '*' -- zero or more times
>
> followed by
>
>     'Lincoln'
>
> and that's all. Whew!
>
> As you can guess, these will all successfully match:
>
>     'A. Lincoln'
>     'AbeLincoln' -- without any space at all
>     'Abraham    Lincoln' -- with lots of spaces
>
> and these won't match:
>
>     'A Lincoln' -- no dot, no 'braham', no 'be'
>     'Abe Gump Lincoln' -- nothing will match 'Gump'
>
>
> Quoting "\\"
> ------------
>
> So what's with the BACKSLASH in front of the dot? Well,
> just as LIKE has the UNDERSCORE (_) to denote "any single
> character", REGULAR EXPRESSIONS use the DOT (.) for
> that very same purpose. So we have to "escape" the
> dot to alter its normal interpretation, using the.
>
> Note that string literals in PostgreSQL already use the
> backslash, so you'll have to double any backslash
> you use:
>
>     psql=# SELECT * FROM person
>     psql-# WHERE fullname ~ 'A(be|braham|\\.) *Lincoln'
>     psql-# -- a second backslash added
>     psql-# ;
>
>
> Grouping "()"
> -------------
>
> How about the PARENTHESES () and the BAR |, hmm? Well,
> that's how you can group permitted alternatives:
>
>     ... ~ 'A(be|braham|\\.)'
>
> That says that 'A' can be followed by any of the three
> sub-expressions within the parentheses, namely 'be' or
> 'braham' or <dot>.
>
> These all produce the same matches:
>
>     ... ~ 'A(be|braham|\\.)'
>     ... ~ 'Abe|Abraham|A\\.'
>     ... ~ 'A(b(e|raham)|\\.)'
>
> Let's dissect this last one -- it specifies that:
>
>     'A'
>
> can be followed by either
>
>     'b' followed by some more stuff
> or
>     <dot>
>
> nestling into the details following the 'b' -- if
> 'A' is followed by 'b', then the 'b' must be followed
> by either
>
>     'e' -- which matches 'Abe'
> or
>     'raham' -- which matches 'Abraham'
>
> Do you see how powerful this can be?
>
>
> Character Classes "[]"
> ----------------------
>
> So you're trying to find "McAndrews" -- but maybe it's
> spelled with a lower-case "a" as in "Mcandrews", instead!
>
>     SELECT * FROM person
>     WHERE fullname ~ 'Mc[Aa]ndrews';
>
> Using the BRACKETS tells the regular expression
> parser that you're allowing any one of a whole class
> of characters in that spot.
>
>     '[Aa]'
>
> It specifies that you're looking for either an uppercase
> 'A' or a lowercase 'a' in that spot.
>
> Here's how you can find fields containing vowels:
>
>     ... ~ '[AEIOUaeiou]'
>
> Search for fields containing lowercase letters:
>
>     ... ~ '[abcdefghijklmnopqrstuvwxyz]'
>
> Of course, there's a shortcut for specifying character
> classes that cover a whole range:
>
>     ... ~ '[a-z]' -- also matches any lowercase letter
>
> Display fields that contain digits:
>
>     ... ~ '[0-9]'
>
> Here's an example using a standard U.S. phone pattern
> (neglecting the area code for clarity and space):
>
>     SELECT * FROM cust
>     WHERE descr ~ '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]';
>
> That will show customers whose "descr" field contains a
> pattern of digits that looks like a U.S. phone number:
>
>     '[0-9][0-9][0-9]' -- three digits
>
> followed by
>
>     '-' -- a hyphen
>
> followed by
>
>     '[0-9][0-9][0-9][0-9]' -- four digits
>
> As you can see, the hyphen doesn't do anything special
> for a regular expression unless it's inside the square
> brackets of a character class -- in which case it
> means "anything between".
>
> What if you want to allow a hyphen within a character class?
> Simply make it the first character inside the brackets:
>
>     '[-.,_]'
>
> That class will match any hyphen, dot, comma or underscore.
>
>
> Bounds "{}"
> -----------
>
> Rather than having to specify each of those [0-9]
> monstrosities every time, let's abbreviate. Instead
> of:
>
>     '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
>
> we can just say
>
>     '[0-9]{3}-[0-9]{4}'.
>
> The BOUNDS specification, inside braces, comes immediately
> after whatever you want a certain number of. That is,
> first you specify what you're looking for, then you specify
> how many you need. (The default is {1,1} one, as you can tell:
> one 'A' followed by one 'b' followed by...)
>
> Inside the BRACES that specify your "bounds" you can
> say you want exactly so many, as we did above (3 and
> then 4) or you can give a low-to-high pair:
>
>     ... ~ 'Z{3}' -- need three Z's
>     ... ~ '@{5,}' -- five or more @'s
>     ... ~ 'Q{2,5}' -- at least two, at most five, Q's
>     ... ~ '(This|That){0,3}'
>
> The last one allows for 'This' or 'That', anywhere from
> zero to three times. So 'ThisThatThis' and '' will match!
> Convenient, hmm?
>
> And for very common cases, there are handy abbreviations:
>
>     'x{1,1}'  'x'    one (the default)
>     'r{0,1}'  'r?'   zero or one, i.e. 'x' is optional
>     'B{0,}'   'B*'   zero or more
>     'z{1,}'   'z+'   one or more
>
> These three abbreviations will pop up often:
>
>     -- allow 'Ms' and 'Mrs' (the 'r' is optional):
>     ... title ~ '(Mr|Mr?s|Dr)'
>
>     -- maybe there's a space, maybe there isn't:
>     ... lname ~ 'Mac ?Affee'
>
>     -- don't care how many trailing spaces:
>     ... lname ~ 'Smith *'
>
>     -- find records with two or more even digits:
>     ... afield ~ '[02468]{2,}'
> or
>     ... afield ~ '[02468][02468]+'
>
> That last one breaks down to
>
>     '[02468]' -- an even digit
>
> followed by
>
>     '[02468]' -- an even digit
>     '+' -- one or more times
>
> Remember: the plus means '{1,}' -- i.e. that the preceeding
> item must match one or more times.
>
> Getting back to our phone number pattern... this is
> the same request as above, but it's easier to
> see what's going on, using the {bounds} feature:
>
>     SELECT * FROM cust
>     WHERE descr ~ '[0-9]{3}-[0-9]{4}';
>     -- three digits, hyphen, four digits
>
>
> Excluding Characters "[^]"
> --------------------------
>
> Note that the above pattern will also match 00000-0000
> because the last three digits of the five in the left
> chunk, with the four digits of the right chunk, fit
> perfectly into the constraints of what we asked for.
> And this ain't no phone number.
>
> Here's an approach to fixing that snag:
>
>     SELECT * FROM cust
>     WHERE descr ~ '[^0-9][0-9]{3}-[0-9]{4}';
>
> Note that we've prefixed the previous pattern with
>
>     '[^0-9]'
>
> because, within the square brackets of a character class,
> the CARAT (^) means "anything EXCEPT..."
>
> So now, 00000-0000 will NOT match; this is what we're
> after.
>
> But there's a new problem: if the phone number is the
> very first thing in the "descr" field it would never
> match because we're demanding that there be SOMETHING
> (besides a digit) before the phone number. What to do?
>
>
> At The Very Beginning "^"
> -------------------------
>
> You can match "beginning-of-field" with the CARAT (^).
> (When it's the first thing inside square brackets, it
> negates the character class; outside brackets, a carat
> means "beginning-of-text". Got that? It's important!)
>
>     SELECT * FROM cust
>     WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}';
>
> Here's what this pattern looks for:
>
> Either
>     '^' -- at the very beginning of text
> or
>     '[^0-9]' -- find something that's a non-digit
>
> followed by
>
>     '[0-9]{3}' -- three digits
>
> and then a
>
>     '-' -- hyphen
>
> and finally
>
>     '[0-9]{4}' -- four digits
>
> and we've solved the 'at-beginning-of-field' problem.
>
>
> At The Very End "$"
> -------------------
>
> Still, we could inadvertently match something like
> 000-000000000 (probably an id number for a replacement
> part) which we don't want to do.
>
> Note how this is the same problem, for the end
> of the field, that we had earlier, with the beginning
> of the field? The solution is similar:
>
>     SELECT * FROM cust
>     WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}($|[^0-9])';
>
> The DOLLAR SIGN ($) represents 'end-of-text' just as
> carat, outside of square brackets, means 'beginning-of-
> text'. The above pattern is the same as before, except
> we appended
>
>     '($|[^0-9])'
>
> to the end, which means we're looking for
>
> either
>
>     '$' -- at the very end-of-text
> or
>     '[^0-9]' -- something that's not a digit
>
> after our previous pattern.
>
> And, finally, we have a comprehensive phone pattern --
> it'll match ###-#### while excluding other numeric
> arrangements -- in a text field.
>
>
> Anchoring
> ---------
>
> Did you notice that regular expressions aren't "anchored",
> as the LIKE expressions are? To specify that you're looking
> only at the beginning of a field, you need to use '^' and
> to include the end of a field, you must use '$'.
>
> It's a good thing regular expressions aren't anchored
> automatically -- we wouldn't have been able to specify
> our phone number pattern above! Things like '(^|[^0-9])'
> are possible because YOU get to specify when and how
> beginning-of-text is required. Cool, eh?
>
> And, just as in the LIKE operator, unless regular expressions
> ARE anchored at the beginning of a field, you'll defeat any
> index you have for that field. Indexes help alphabetize by
> comparing the beginning of your fields to each other; unless
> you're looking at the beginning of the field, your index can't
> be used.
>
> Here are some comparisons between operators LIKE and ~ :
>
>     -- list all customers within the 47610 postal code:
>     SELECT * FROM cust WHERE zip LIKE '47610%';
>         -- zip begins with '47610' then has anything
>     SELECT * FROM cust WHERE zip ~ '^47610';
>         -- zip begins with '47610'
>
>     -- display customers who have 'Corp' in their names
>     SELECT * FROM cust WHERE name LIKE '%Corp%';
>         -- name contains 'Corp' with anything before and after
>     SELECT * FROM cust WHERE name ~ 'Corp';
>         -- name contains 'Corp'
>
>     -- show customers whose names end in 'LLC'
>     SELECT * FROM cust WHERE name LIKE '%LLC';
>         -- name can have anything, with 'LLC' at the end
>     SELECT * FROM cust WHERE name ~ 'LLC$';
>         -- name must have 'LLC' at the end
>
>     -- documents beginning with 'We', ending with 'rica'
>     SELECT * FROM doc WHERE contents LIKE 'We%rica';
>         -- starts with 'We', has anything, ending with 'rica'
>     SELECT * FROM doc WHERE contents ~ '^We.*rica$';
>         -- start with 'We', zero or more chars, end with 'rica'
>
> Remember that in regular expressions, DOT means 'any character'.
> Thus '.*' means 'any character, zero times or more' which, in
> English, means "anything can go here, including nothing at all".
>
> There are ways to anchor your searches to word boundaries,
> as well -- not just beginning-of-field and end-of-field. See
> your documentation for details.
>
>
> Case-Insensitive "~*"
> ---------------------
>
> If you're not worried about differentiating between
> uppercase and lowercase in your regular expressions,
> you could go full boar and try
>
>     SELECT * FROM cust
>     WHERE lname ~ '[Oo][Cc][Tt][Aa][Vv][Ii][Aa][Nn]';
>
> Fortunately there's a quickie to make it easier for you --
> where you've been using the tilde (~) as your operator, use
> tilde-star (~*) instead:
>
>     SELECT * FROM cust
>     WHERE lname ~* 'Octavian';
>
> ILIKE is case-insensitive (where LIKE is case-specific) just
> as ~* is case-insensitive for regular expressions (where ~ is
> case-specific). Very handy!
>
>
> Example
> -------
>
> Email addresses can look like any of these:
>
>     me@my.net
>     someone-unimportant9237@this.little.org
>     first.last.title@obscure.sub-net.biggie.com
>     _weird_@somewhere.out_there.net
>
> A reasonably-functional email address pattern might be
> something like this:
>
>     '[a-z0-9_]+([\\-\\.][a-z0-9_]+)*@[a-z0-9_]+([\\-\\.][a-z0-9_]+)+'
>
> That breaks down to (using case-indifferent via ~*, of course):
>
>     '[a-z0-9_]'
>
> any alphanumeric (including underscore) character
>
>     '+'
>
> one or more times, followed by
>
>     '([\\-\\.][a-z0-9_]+)'
>
> email stuff [1],
>
>     '*'
>
> zero or more times, followed by
>
>     '@'
>
> which is then followed by
>
>     '[a-z0-9_]'
>
> alphanumerics (including underscore)
>
>     '+'
>
> one or more times, followed by
>
>     '([\\-\\.][a-z0-9_]+)'
>
> site stuff [2]
>
>     '+'
>
> one or more times.
>
>
> As for [1] email stuff, it can be (zero or more of):
>
>     '[\\-\\.]'
>
> hyphen or dot, followed by
>
>     '[a-z0-9_]'
>
> alphanumerics (or underscore)
>
>     '+'
>
> one or more times.
>
>
> And as for [2] site stuff, it can be (one or more of):
>
>     '[\\-\\.]'
>
> hyphen or dot, followed by
>
>     '[a-z0-9_]'
>
> alphanumerics (or underscore)
>
>     '+'
>
> one or more times.
>
> So the personal part of the address can just be a
> single word, with optional dot-or-hyphen joining other
> words (ZERO or more) onto it; this is followed by at-sign,
> which is followed by the site portion of the address,
> namely: any word, followed by ONE or more dot-or-hyphen-
> followed-by-another-word combo's.
>
> It's not a perfect match for all legal email patterns,
> (for example, bad@my-addr would match) but something like
> it might be sufficient depending on your requirements.
>
>
> CONCLUSION
> ==========
>
> Regular expressions are complicated, because they're
> powerful! (Or is it the other way around?)
>
> Check out "Pattern Matching" in the PostgreSQL manual
> for more details -- on my Debian "Potato" system it's in
> html/postgres/functions-matching.html under the
> /usr/share/doc/postgresql-doc/ directory.
>
> --
> mailto:will@serensoft.com
> http://www.dontUthink.com/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

pgsql-general by date:

Previous
From:
Date:
Subject: Re: Controling Rule's Firing Order
Next
From: Bruce Momjian
Date:
Subject: Re: Solved! MacOS X and external functions