Re: Regular Expression INTRO [long] [OT] - Mailing list pgsql-general
From | Bruce Momjian |
---|---|
Subject | Re: Regular Expression INTRO [long] [OT] |
Date | |
Msg-id | 200208260035.g7Q0ZTr25111@candle.pha.pa.us Whole thread Raw |
In response to | Regular Expression INTRO [long] [OT] (will trillich <will@serensoft.com>) |
List | pgsql-general |
Justin, have added this to techdocs? I have a followup email here with some clarifications. --------------------------------------------------------------------------- will trillich wrote: > wow -- this was nearly a year ago and i still haven't > done it! > > Justin Clift wrote: > > Would you mind taking a few moments and writing a guide on using Regex > > functions in PostgreSQL? > > > > This is to put on the techdocs.postgresql.org website so everyone in the > > PostgreSQL community will have access to it. > > i'm posting to pgsql-general so the community > can help find holes and plug them... > > here goes -- > > > > > > > Finding data in PostgreSQL using LIKE or ~ (REGULAR EXPRESSIONS): > > > LIKE > ==== > > You're probably familiar with the WHERE clause > of the SELECT statement -- > > SELECT somefield FROM table WHERE somefield = somevalue; > > For example, to show all customers within the > post code "60201" you could try > > SELECT * FROM cust WHERE zip = '60201'; > > But if you also have nine-digit zips stored in the zip > field (in the U.S.A. we have standard five-digit postal > codes and also a four-digit extension for further > geographic resolution) then this search will OMIT > customers having more-specific zips such as '60201-9876' > or '60201-1234'. > > > The Percent "%" > --------------- > > Try using LIKE instead of = (EQUALS): > > SELECT * FROM cust WHERE zip LIKE '60201%'; > > With LIKE, your search values can use the percent (%) > to tell PostgreSQL that anything can occupy that spot > -- one character, twelve symbols or zero digits -- and > still satisfy the search. > > On Unix or Linux, this is basically the same as the > ASTERISK (*) at the command line, when dealing with > file names: > > # list all files whose names begin with '.bash' > ls .bash* > > # list all files containing 'out' anywhere in the name > ls *out* > > # list all file names ending with '.pl' > ls *.pl > > # list file starting with 'proj', ending with '.c' > ls proj*.c > > With PostgreSQL using the LIKE operator, use the > percent, instead: > > -- list all customers within the 47610 postal code > SELECT * FROM cust WHERE zip LIKE '47610%'; > > -- display customers who have 'Corp' in their names > SELECT * FROM cust WHERE name LIKE '%Corp%'; > > -- show customers whose names end in 'LLC' > SELECT * FROM cust WHERE name LIKE '%LLC'; > > -- documents beginning with 'We', ending with 'rica' > SELECT * FROM doc WHERE contents LIKE 'We%rica'; > > Wherever the '%' appears (using the LIKE operator) > Postgres allows anything at all to match -- from a > lengthy string of text, to one single character, to > a zero-length string -- i.e. nothing at all. > > ...ILIKE 'A%Z' > -- matches 'AZ' and 'A to Z' and 'ABC2468XYZ' and 'A@$*Z' > > > The Underscore "_" > ------------------ > > For the LIKE operator, the UNDERSCORE (_) takes on the same > meaning as the question mark does to shell operations for > Unix and Linux file names: > > # list files starting with dot, followed by at least two chars > ls .??* > > The underscore matches just one character exactly: > > SELECT * FROM atable WHERE afield LIKE '_'; > -- shows records where afield is exactly one character > -- omitting ones where it's blank or has two (or more) > -- characters in it) > > You can restrict your searches to finding fields > of certain lengths this way: > > SELECT * FROM cust WHERE zip LIKE '_____' OR zip LIKE '_________'; > -- display all five- or nine-character zip codes > > Combine it with percent to find fields over a certain length: > > SELECT * FROM atable WHERE afield LIKE '___%'; > -- display records where afield has three or more characters > > Or you can accomodate some language idiosyncracies, as > well: > > SELECT * FROM activity WHERE venue LIKE 'Theat__'; > -- find u.k. 'theatre' and u.s. 'theater' > > > ILIKE > ===== > > But what if a customer's name is all capitals, as in > 'THE ACME CORPORATION'? The "...LIKE '%Corp%'" won't > find it, because uppercase 'O' is not identical to > lowercase 'o'! > > Finding text independent of uppercase/lowercase > is something we often need to do. And so we have > the ILIKE operator. > > Use ILIKE instead of LIKE when your search should > ignore case -- the "I" stands for case-[I]nsensitive: > > -- show all customers with 'corp' in the name > SELECT * FROM cust WHERE name ILIKE '%Corp%'; > -- ignoring whether the field is upper- or lowercase > > That'll find 'Bubba Gump Shrimp Corp' and 'ACME CORP' > and even 'Amalgamated Switch and Relay corporation' > because anything can ('%') precede or ('%') follow the > string 'Corp' which can be uppercase or lowercase. > > But there's even more power in "regular expressions"! > See below... > > > Notes > ----- > > Always include some real data to search for! > > -- show everything, doing a lot of unnecessary work: > SELECT * FROM atable WHERE afield LIKE '%'; > > It would be rather silly to make PostgreSQL compare > every record to see if it matched 'anything goes, > in "afield"'. Just say no. > > Convert your spoken-language request to a LIKE > clause like this: > > "afield ENDS WITH avalue" > reword => "afield starts with anything, ending with avalue" > sql => "... WHERE afield LIKE '%avalue'" > > "afield STARTS WITH avalue" > reword => "afield starts with avalue, ending with anything" > sql => "... WHERE afield LIKE 'avalue%'" > > See how that works? > > "afield CONTAINS avalue" > sql => "... WHERE afield LIKE '%avalue%'" > > If you're looking for something at the beginning of > a field, that's where your data goes, and you END with > the percent. If you're looking for something at the end > of a field, put your data there and allow anything at > the front by putting the percent there. > > ALSO -- if your field is indexed, you'll defeat the index > unless you anchor your search to the beginning of the field. > > -- find fields starting with 'something' > ... afield LIKE 'something%'; -- uses afield's index > > The example below isn't anchored at the start of the field, > so it does no good to use the index: > > -- find fields containing 'pooh' > ... afield LIKE '%pooh%'; -- can't use afield's index > > Afield must contain 'pooh' BUT it could be anywhere in > the field. The index would be useless. > > > LIMITATION > ========== > > Okay. Let's say you have a full-name field that has > first, middle, and last name all in the one field. > Sometimes you'll have a middle name, sometimes > you won't. Sometimes the first name will be spelled > out, other times it'll be shortened or even abbreviated > to just the first initial. > > Here's how you'd search that field for "Abraham Lincoln" > using the LIKE operator: > > SELECT * FROM peron WHERE fullname LIKE 'A%Lincoln'; > > It'll find 'Abraham Lincoln' and 'A. Lincoln' and > 'Abe Lincoln' because anything can go between the > initial 'A' and the ending 'Lincoln'. > > But -- it'll also match 'Andrew M. Lincoln' and 'Abner > Sasquatch Lincoln' as well... and this is why some bright > souls created regular expressions! > > > REGULAR EXPRESSIONS > =================== > > A regular expression is a pattern to search for. > The structure of the pattern has to match a > rigid set of rules so that PostgreSQL will know > what you're trying to search for. > > This introduction will barely scratch the surface, > but hopefully it'll get you started-- > > If you've ever tinkered with perl, you've probably > encountered regular expressions: > > perl -ne 'print if /[A-Za-z]/;' somefile > > Here, the "[A-Za-z]" is a regular expression. > It matches any line containing uppercase (A-Z) > or lowercase (a-z) letters. In this example, the > lines will come from the file "somefile" and > lines that match will be printed out. > > Well, PostgreSQL can do that, too! > > > Example > ------- > > SELECT * FROM person > WHERE fullname ~ 'A(be|braham|\\.) *Lincoln'; > > The REGULAR EXPRESSION operator is the TILDE (~). > The value you're searching for then must be a valid > regular expression. In the above example, we're asking > PostgreSQL to display all rows containing > > 'A' > > followed by EITHER > > 'be' > OR > 'braham' > OR > '\\.' -- a dot > > which is then followed by > > ' ' -- a space > '*' -- zero or more times > > followed by > > 'Lincoln' > > and that's all. Whew! > > As you can guess, these will all successfully match: > > 'A. Lincoln' > 'AbeLincoln' -- without any space at all > 'Abraham Lincoln' -- with lots of spaces > > and these won't match: > > 'A Lincoln' -- no dot, no 'braham', no 'be' > 'Abe Gump Lincoln' -- nothing will match 'Gump' > > > Quoting "\\" > ------------ > > So what's with the BACKSLASH in front of the dot? Well, > just as LIKE has the UNDERSCORE (_) to denote "any single > character", REGULAR EXPRESSIONS use the DOT (.) for > that very same purpose. So we have to "escape" the > dot to alter its normal interpretation, using the. > > Note that string literals in PostgreSQL already use the > backslash, so you'll have to double any backslash > you use: > > psql=# SELECT * FROM person > psql-# WHERE fullname ~ 'A(be|braham|\\.) *Lincoln' > psql-# -- a second backslash added > psql-# ; > > > Grouping "()" > ------------- > > How about the PARENTHESES () and the BAR |, hmm? Well, > that's how you can group permitted alternatives: > > ... ~ 'A(be|braham|\\.)' > > That says that 'A' can be followed by any of the three > sub-expressions within the parentheses, namely 'be' or > 'braham' or <dot>. > > These all produce the same matches: > > ... ~ 'A(be|braham|\\.)' > ... ~ 'Abe|Abraham|A\\.' > ... ~ 'A(b(e|raham)|\\.)' > > Let's dissect this last one -- it specifies that: > > 'A' > > can be followed by either > > 'b' followed by some more stuff > or > <dot> > > nestling into the details following the 'b' -- if > 'A' is followed by 'b', then the 'b' must be followed > by either > > 'e' -- which matches 'Abe' > or > 'raham' -- which matches 'Abraham' > > Do you see how powerful this can be? > > > Character Classes "[]" > ---------------------- > > So you're trying to find "McAndrews" -- but maybe it's > spelled with a lower-case "a" as in "Mcandrews", instead! > > SELECT * FROM person > WHERE fullname ~ 'Mc[Aa]ndrews'; > > Using the BRACKETS tells the regular expression > parser that you're allowing any one of a whole class > of characters in that spot. > > '[Aa]' > > It specifies that you're looking for either an uppercase > 'A' or a lowercase 'a' in that spot. > > Here's how you can find fields containing vowels: > > ... ~ '[AEIOUaeiou]' > > Search for fields containing lowercase letters: > > ... ~ '[abcdefghijklmnopqrstuvwxyz]' > > Of course, there's a shortcut for specifying character > classes that cover a whole range: > > ... ~ '[a-z]' -- also matches any lowercase letter > > Display fields that contain digits: > > ... ~ '[0-9]' > > Here's an example using a standard U.S. phone pattern > (neglecting the area code for clarity and space): > > SELECT * FROM cust > WHERE descr ~ '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'; > > That will show customers whose "descr" field contains a > pattern of digits that looks like a U.S. phone number: > > '[0-9][0-9][0-9]' -- three digits > > followed by > > '-' -- a hyphen > > followed by > > '[0-9][0-9][0-9][0-9]' -- four digits > > As you can see, the hyphen doesn't do anything special > for a regular expression unless it's inside the square > brackets of a character class -- in which case it > means "anything between". > > What if you want to allow a hyphen within a character class? > Simply make it the first character inside the brackets: > > '[-.,_]' > > That class will match any hyphen, dot, comma or underscore. > > > Bounds "{}" > ----------- > > Rather than having to specify each of those [0-9] > monstrosities every time, let's abbreviate. Instead > of: > > '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]' > > we can just say > > '[0-9]{3}-[0-9]{4}'. > > The BOUNDS specification, inside braces, comes immediately > after whatever you want a certain number of. That is, > first you specify what you're looking for, then you specify > how many you need. (The default is {1,1} one, as you can tell: > one 'A' followed by one 'b' followed by...) > > Inside the BRACES that specify your "bounds" you can > say you want exactly so many, as we did above (3 and > then 4) or you can give a low-to-high pair: > > ... ~ 'Z{3}' -- need three Z's > ... ~ '@{5,}' -- five or more @'s > ... ~ 'Q{2,5}' -- at least two, at most five, Q's > ... ~ '(This|That){0,3}' > > The last one allows for 'This' or 'That', anywhere from > zero to three times. So 'ThisThatThis' and '' will match! > Convenient, hmm? > > And for very common cases, there are handy abbreviations: > > 'x{1,1}' 'x' one (the default) > 'r{0,1}' 'r?' zero or one, i.e. 'x' is optional > 'B{0,}' 'B*' zero or more > 'z{1,}' 'z+' one or more > > These three abbreviations will pop up often: > > -- allow 'Ms' and 'Mrs' (the 'r' is optional): > ... title ~ '(Mr|Mr?s|Dr)' > > -- maybe there's a space, maybe there isn't: > ... lname ~ 'Mac ?Affee' > > -- don't care how many trailing spaces: > ... lname ~ 'Smith *' > > -- find records with two or more even digits: > ... afield ~ '[02468]{2,}' > or > ... afield ~ '[02468][02468]+' > > That last one breaks down to > > '[02468]' -- an even digit > > followed by > > '[02468]' -- an even digit > '+' -- one or more times > > Remember: the plus means '{1,}' -- i.e. that the preceeding > item must match one or more times. > > Getting back to our phone number pattern... this is > the same request as above, but it's easier to > see what's going on, using the {bounds} feature: > > SELECT * FROM cust > WHERE descr ~ '[0-9]{3}-[0-9]{4}'; > -- three digits, hyphen, four digits > > > Excluding Characters "[^]" > -------------------------- > > Note that the above pattern will also match 00000-0000 > because the last three digits of the five in the left > chunk, with the four digits of the right chunk, fit > perfectly into the constraints of what we asked for. > And this ain't no phone number. > > Here's an approach to fixing that snag: > > SELECT * FROM cust > WHERE descr ~ '[^0-9][0-9]{3}-[0-9]{4}'; > > Note that we've prefixed the previous pattern with > > '[^0-9]' > > because, within the square brackets of a character class, > the CARAT (^) means "anything EXCEPT..." > > So now, 00000-0000 will NOT match; this is what we're > after. > > But there's a new problem: if the phone number is the > very first thing in the "descr" field it would never > match because we're demanding that there be SOMETHING > (besides a digit) before the phone number. What to do? > > > At The Very Beginning "^" > ------------------------- > > You can match "beginning-of-field" with the CARAT (^). > (When it's the first thing inside square brackets, it > negates the character class; outside brackets, a carat > means "beginning-of-text". Got that? It's important!) > > SELECT * FROM cust > WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}'; > > Here's what this pattern looks for: > > Either > '^' -- at the very beginning of text > or > '[^0-9]' -- find something that's a non-digit > > followed by > > '[0-9]{3}' -- three digits > > and then a > > '-' -- hyphen > > and finally > > '[0-9]{4}' -- four digits > > and we've solved the 'at-beginning-of-field' problem. > > > At The Very End "$" > ------------------- > > Still, we could inadvertently match something like > 000-000000000 (probably an id number for a replacement > part) which we don't want to do. > > Note how this is the same problem, for the end > of the field, that we had earlier, with the beginning > of the field? The solution is similar: > > SELECT * FROM cust > WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}($|[^0-9])'; > > The DOLLAR SIGN ($) represents 'end-of-text' just as > carat, outside of square brackets, means 'beginning-of- > text'. The above pattern is the same as before, except > we appended > > '($|[^0-9])' > > to the end, which means we're looking for > > either > > '$' -- at the very end-of-text > or > '[^0-9]' -- something that's not a digit > > after our previous pattern. > > And, finally, we have a comprehensive phone pattern -- > it'll match ###-#### while excluding other numeric > arrangements -- in a text field. > > > Anchoring > --------- > > Did you notice that regular expressions aren't "anchored", > as the LIKE expressions are? To specify that you're looking > only at the beginning of a field, you need to use '^' and > to include the end of a field, you must use '$'. > > It's a good thing regular expressions aren't anchored > automatically -- we wouldn't have been able to specify > our phone number pattern above! Things like '(^|[^0-9])' > are possible because YOU get to specify when and how > beginning-of-text is required. Cool, eh? > > And, just as in the LIKE operator, unless regular expressions > ARE anchored at the beginning of a field, you'll defeat any > index you have for that field. Indexes help alphabetize by > comparing the beginning of your fields to each other; unless > you're looking at the beginning of the field, your index can't > be used. > > Here are some comparisons between operators LIKE and ~ : > > -- list all customers within the 47610 postal code: > SELECT * FROM cust WHERE zip LIKE '47610%'; > -- zip begins with '47610' then has anything > SELECT * FROM cust WHERE zip ~ '^47610'; > -- zip begins with '47610' > > -- display customers who have 'Corp' in their names > SELECT * FROM cust WHERE name LIKE '%Corp%'; > -- name contains 'Corp' with anything before and after > SELECT * FROM cust WHERE name ~ 'Corp'; > -- name contains 'Corp' > > -- show customers whose names end in 'LLC' > SELECT * FROM cust WHERE name LIKE '%LLC'; > -- name can have anything, with 'LLC' at the end > SELECT * FROM cust WHERE name ~ 'LLC$'; > -- name must have 'LLC' at the end > > -- documents beginning with 'We', ending with 'rica' > SELECT * FROM doc WHERE contents LIKE 'We%rica'; > -- starts with 'We', has anything, ending with 'rica' > SELECT * FROM doc WHERE contents ~ '^We.*rica$'; > -- start with 'We', zero or more chars, end with 'rica' > > Remember that in regular expressions, DOT means 'any character'. > Thus '.*' means 'any character, zero times or more' which, in > English, means "anything can go here, including nothing at all". > > There are ways to anchor your searches to word boundaries, > as well -- not just beginning-of-field and end-of-field. See > your documentation for details. > > > Case-Insensitive "~*" > --------------------- > > If you're not worried about differentiating between > uppercase and lowercase in your regular expressions, > you could go full boar and try > > SELECT * FROM cust > WHERE lname ~ '[Oo][Cc][Tt][Aa][Vv][Ii][Aa][Nn]'; > > Fortunately there's a quickie to make it easier for you -- > where you've been using the tilde (~) as your operator, use > tilde-star (~*) instead: > > SELECT * FROM cust > WHERE lname ~* 'Octavian'; > > ILIKE is case-insensitive (where LIKE is case-specific) just > as ~* is case-insensitive for regular expressions (where ~ is > case-specific). Very handy! > > > Example > ------- > > Email addresses can look like any of these: > > me@my.net > someone-unimportant9237@this.little.org > first.last.title@obscure.sub-net.biggie.com > _weird_@somewhere.out_there.net > > A reasonably-functional email address pattern might be > something like this: > > '[a-z0-9_]+([\\-\\.][a-z0-9_]+)*@[a-z0-9_]+([\\-\\.][a-z0-9_]+)+' > > That breaks down to (using case-indifferent via ~*, of course): > > '[a-z0-9_]' > > any alphanumeric (including underscore) character > > '+' > > one or more times, followed by > > '([\\-\\.][a-z0-9_]+)' > > email stuff [1], > > '*' > > zero or more times, followed by > > '@' > > which is then followed by > > '[a-z0-9_]' > > alphanumerics (including underscore) > > '+' > > one or more times, followed by > > '([\\-\\.][a-z0-9_]+)' > > site stuff [2] > > '+' > > one or more times. > > > As for [1] email stuff, it can be (zero or more of): > > '[\\-\\.]' > > hyphen or dot, followed by > > '[a-z0-9_]' > > alphanumerics (or underscore) > > '+' > > one or more times. > > > And as for [2] site stuff, it can be (one or more of): > > '[\\-\\.]' > > hyphen or dot, followed by > > '[a-z0-9_]' > > alphanumerics (or underscore) > > '+' > > one or more times. > > So the personal part of the address can just be a > single word, with optional dot-or-hyphen joining other > words (ZERO or more) onto it; this is followed by at-sign, > which is followed by the site portion of the address, > namely: any word, followed by ONE or more dot-or-hyphen- > followed-by-another-word combo's. > > It's not a perfect match for all legal email patterns, > (for example, bad@my-addr would match) but something like > it might be sufficient depending on your requirements. > > > CONCLUSION > ========== > > Regular expressions are complicated, because they're > powerful! (Or is it the other way around?) > > Check out "Pattern Matching" in the PostgreSQL manual > for more details -- on my Debian "Potato" system it's in > html/postgres/functions-matching.html under the > /usr/share/doc/postgresql-doc/ directory. > > -- > mailto:will@serensoft.com > http://www.dontUthink.com/ > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
pgsql-general by date: