Home > mailing lists
Re: regex_fixed_prefix() is still a few bricks shy of a load - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: regex_fixed_prefix() is still a few bricks shy of a load
Date	July 8, 2012 19:39:53
Msg-id	17440.1341787170@sss.pgh.pa.us Whole thread Raw
In response to	Re: regex_fixed_prefix() is still a few bricks shy of a load (Robert Haas <robertmhaas@gmail.com>)
List	pgsql-hackers
Tree view
Robert Haas <robertmhaas@gmail.com> writes:
> On Jul 7, 2012, at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 3. Try another approach entirely.  The idea that I've got in mind here
>> is to compile the regex using the regex library and then look at the
>> compiled NFA representation to see if there must be a fixed prefix.

> I think this is clearly the best way forward, probably even in the
> back branches.  It's true that the wchar to mb conversion is largely
> untested, but it's also pretty simple code.  Sure, it could have bugs,
> but so could whatever work-around you cobble together to avoid
> back-patching it.  And it's not like we'll break anything else,
> either: the code will only be used in the case that is buggy right now
> anyway.

Attached is a draft patch for that.  I'm fairly happy with the way this
turned out.  The code is a bit longer than before, but most of the net
addition is boilerplate code associated with maintaining a well-defined
API for the regex library proper and the regex caching code in
utils/adt/regexp.c.  The code that actually extracts the prefix string
(findprefix()) is about 150 lines, comparable to the net removal from
regex_fixed_prefix(), and it is *way* less heuristic: basically, given
the data structure it's working with, there is just one right answer.

One thing that remains slightly klugy is identification of the character
code associated with a colormap "color".  The regex library's colormap
data structure is only designed to provide forward lookup from char
codes to colors; if you want to go the other way, the only possibility
is to serially probe each char value, which is untenable for non-Western
alphabets.  What I did about this was to tweak the colormap building
code to remember just the first char code entered for each color.  If,
after the dust settles, that's the only char belonging to the color,
we can use it --- otherwise, we just punt and stop extending the prefix
string.  Given that we only care about doing reverse mapping for
singleton colors anyway, I believe that this is adequate.  There are
cases where a color might have only one member but it isn't the first
one added, for example in "^abc[de]d".  Here, d and e will be added to
a color for the bracket expression, and afterwards d will be split out
to its own subcolor, leaving e as the sole member of a color it wasn't
the first member of.  But for our purposes it doesn't matter, because
both the d and e colors will be outarcs from the state after c, so
we couldn't extend the fixed prefix to include e anyway.  There might
be some obscure corner cases where this implementation fails to
recognize a usable fixed-prefix character, but they have to be pretty
darn obscure.

Another loose end is that the API for regex_fixed_prefix() supposes
that it can return not only the fixed prefix string extracted from
the pattern, but also the "rest" of the pattern.  There's no way to
reconstitute a "remaining pattern" in what I've added to backend/regex,
so in this patch regex_fixed_prefix() is just passing back the whole
pattern in the Pattern_Prefix_Partial case, which is likely to lead
to a lowball selectivity estimate when there's a long prefix.

Now the previous implementation of that is pretty darn brain-dead
anyway, because what it hands back is whatever's left when it stops
scanning.  For instance, given "^(ab[de])xyz" it would extract the fixed
prefix "ab" and then return "rest" as "[de])xyz", which isn't a valid
regex.  The only thing we are doing with that is passing it to
regex_selectivity(), which is too stupid to have a problem with it; but
obviously there is no chance of ever upgrading the logic without fixing
that somehow.  (Note that right now we will never reach any of this code
anyway unless the target column has a small histogram; we realized long
ago that it was so bogus that relying on histogram_selectivity is a much
better idea if there's a reasonable amount of data...)

It's interesting to think about building something that would examine
the NFA representation of the regex, starting from whatever state
findprefix() stops at, and apply heuristic selectivity calculations
similar to what regex_selectivity() does.  However I don't see where
to put such code while maintaining a reasonable API wall between PG's
selectivity estimators and the regex library.  It's surely not something
that could be considered a general-purpose addition to a standalone
regex library.

What I have in mind to do for the moment is to refactor
regex_fixed_prefix() so that it doesn't hand back a "rest of pattern"
per se, but is directly responsible for handing back a selectivity
estimate for the rest of the pattern (a change we'd need to make anyway
if we ever did what's contemplated in the previous para).  What it can
do is run regex_selectivity() over the whole pattern, and then divide
out FIXED_CHAR_SEL times the length of the fixed prefix, which should
compensate reasonably well for letting regex_selectivity() see the
prefix portion of the pattern.

One other point is that although this change adds regexp compilation
to the planner code path, that should be nearly free in common cases,
because we're caching the compiled regexp, thus saving having to compile
it at query runtime.  I haven't tried to measure but I think the total
runtime should be pretty similar to what it was before.

Comments?

            regards, tom lane

diff --git a/src/backend/regex/Makefile b/src/backend/regex/Makefile
index 21e7fa5329b9384333d6a8c9912be81dd24a4867..74a4c0c89d8efedcb8699dffac465def523431a9 100644
*** a/src/backend/regex/Makefile
--- b/src/backend/regex/Makefile
*************** subdir = src/backend/regex
*** 12,18 ****
  top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = regcomp.o regerror.o regexec.o regfree.o

  include $(top_srcdir)/src/backend/common.mk

--- 12,18 ----
  top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = regcomp.o regerror.o regexec.o regfree.o regprefix.o

  include $(top_srcdir)/src/backend/common.mk

diff --git a/src/backend/regex/README b/src/backend/regex/README
index 89ba6a62ea2f70bfda729f03efedba4b9b6fce9b..c5d21e8c99da1c512379c1702bb7e422355ba9d9 100644
*** a/src/backend/regex/README
--- b/src/backend/regex/README
*************** So this file is an attempt to reverse-en
*** 7,18 ****
  General source-file layout
  --------------------------

! There are four separately-compilable source files, each exposing exactly
  one exported function:
      regcomp.c: pg_regcomp
      regexec.c: pg_regexec
      regerror.c: pg_regerror
      regfree.c: pg_regfree
  (The pg_ prefixes were added by the Postgres project to distinguish this
  library version from any similar one that might be present on a particular
  system.  They'd need to be removed or replaced in any standalone version
--- 7,19 ----
  General source-file layout
  --------------------------

! There are five separately-compilable source files, each exposing exactly
  one exported function:
      regcomp.c: pg_regcomp
      regexec.c: pg_regexec
      regerror.c: pg_regerror
      regfree.c: pg_regfree
+     regprefix.c: pg_regprefix
  (The pg_ prefixes were added by the Postgres project to distinguish this
  library version from any similar one that might be present on a particular
  system.  They'd need to be removed or replaced in any standalone version
*************** regexec.c        Top-level regex execution cod
*** 44,49 ****
--- 45,51 ----
  rege_dfa.c        DFA creation and execution
  regerror.c        pg_regerror: generate text for a regex error code
  regfree.c        pg_regfree: API to free a no-longer-needed regex_t
+ regprefix.c        Code for extracting a common prefix from a regex_t

  The locale-specific code is concerned primarily with case-folding and with
  expanding locale-specific character classes, such as [[:alnum:]].  It
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index 2aeb861d9762f0d9f5ec667007ce1b00d22fccde..1c60566fbf57458a1f43a4116432e0636ebcd92d 100644
*** a/src/backend/regex/regc_color.c
--- b/src/backend/regex/regc_color.c
*************** initcm(struct vars * v,
*** 66,73 ****
      cd = cm->cd;                /* cm->cd[WHITE] */
      cd->sub = NOSUB;
      cd->arcs = NULL;
!     cd->flags = 0;
      cd->nchrs = CHR_MAX - CHR_MIN + 1;

      /* upper levels of tree */
      for (t = &cm->tree[0], j = NBYTS - 1; j > 0; t = nextt, j--)
--- 66,74 ----
      cd = cm->cd;                /* cm->cd[WHITE] */
      cd->sub = NOSUB;
      cd->arcs = NULL;
!     cd->firstchr = CHR_MIN;
      cd->nchrs = CHR_MAX - CHR_MIN + 1;
+     cd->flags = 0;

      /* upper levels of tree */
      for (t = &cm->tree[0], j = NBYTS - 1; j > 0; t = nextt, j--)
*************** newcolor(struct colormap * cm)
*** 272,277 ****
--- 273,279 ----
      cd->nchrs = 0;
      cd->sub = NOSUB;
      cd->arcs = NULL;
+     cd->firstchr = CHR_MIN;        /* in case never set otherwise */
      cd->flags = 0;
      cd->block = NULL;

*************** subcolor(struct colormap * cm, chr c)
*** 371,376 ****
--- 373,380 ----
      if (co == sco)                /* already in an open subcolor */
          return co;                /* rest is redundant */
      cm->cd[co].nchrs--;
+     if (cm->cd[sco].nchrs == 0)
+         cm->cd[sco].firstchr = c;
      cm->cd[sco].nchrs++;
      setcolor(cm, c, sco);
      return sco;
*************** subrange(struct vars * v,
*** 438,443 ****
--- 442,452 ----

  /*
   * subblock - allocate new subcolors for one tree block of chrs, fill in arcs
+  *
+  * Note: subcolors that are created during execution of this function
+  * will not be given a useful value of firstchr; it'll be left as CHR_MIN.
+  * For the current usage of firstchr in pg_regprefix, this does not matter
+  * because such subcolors won't occur in the common prefix of a regex.
   */
  static void
  subblock(struct vars * v,
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index ...3b4124a71acf966a16db6f51f8e345ac2e703dcd .
*** a/src/backend/regex/regprefix.c
--- b/src/backend/regex/regprefix.c
***************
*** 0 ****
--- 1,259 ----
+ /*-------------------------------------------------------------------------
+  *
+  * regprefix.c
+  *      Extract a common prefix, if any, from a compiled regex.
+  *
+  *
+  * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1998, 1999 Henry Spencer
+  *
+  * IDENTIFICATION
+  *      src/backend/regex/regprefix.c
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ #include "regex/regguts.h"
+
+
+ /*
+  * forward declarations
+  */
+ static int findprefix(struct cnfa * cnfa, struct colormap * cm,
+                       chr *string, size_t *slength);
+
+
+ /*
+  * pg_regprefix - get common prefix for regular expression
+  *
+  * Returns one of:
+  *    REG_NOMATCH: there is no common prefix of strings matching the regex
+  *    REG_PREFIX: there is a common prefix of strings matching the regex
+  *    REG_EXACT: all strings satisfying the regex must match the same string
+  *    or a REG_XXX error code
+  *
+  * In the non-failure cases, *string is set to a malloc'd string containing
+  * the common prefix or exact value, of length *slength (measured in chrs
+  * not bytes!).
+  *
+  * This function does not analyze all complex cases (such as lookahead
+  * constraints) exactly.  Therefore it is possible that some strings matching
+  * the reported prefix or exact-match string do not satisfy the regex.  But
+  * it should never be the case that a string satisfying the regex does not
+  * match the reported prefix or exact-match string.
+  */
+ int
+ pg_regprefix(regex_t *re,
+              chr **string,
+              size_t *slength)
+ {
+     struct guts *g;
+     struct cnfa *cnfa;
+     int            st;
+
+     /* sanity checks */
+     if (string == NULL || slength == NULL)
+         return REG_INVARG;
+     *string = NULL;                /* initialize for failure cases */
+     *slength = 0;
+     if (re == NULL || re->re_magic != REMAGIC)
+         return REG_INVARG;
+     if (re->re_csize != sizeof(chr))
+         return REG_MIXED;
+
+     /* Initialize locale-dependent support */
+     pg_set_regex_collation(re->re_collation);
+
+     /* setup */
+     g = (struct guts *) re->re_guts;
+     if (g->info & REG_UIMPOSSIBLE)
+         return REG_NOMATCH;
+
+     /*
+      * This implementation considers only the search NFA for the topmost regex
+      * tree node.  Therefore, constraints such as backrefs are not fully
+      * applied, which is allowed per the function's API spec.
+      */
+     assert(g->tree != NULL);
+     cnfa = &g->tree->cnfa;
+
+     /*
+      * Since a correct NFA should never contain any exit-free loops, it should
+      * not be possible for our traversal to return to a previously visited
+      * NFA state.  Hence we need at most nstates chrs in the output string.
+      */
+     *string = (chr *) MALLOC(cnfa->nstates * sizeof(chr));
+     if (*string == NULL)
+         return REG_ESPACE;
+
+     /* do it */
+     st = findprefix(cnfa, &g->cmap, *string, slength);
+
+     assert(*slength <= cnfa->nstates);
+
+     /* clean up */
+     if (st != REG_PREFIX && st != REG_EXACT)
+     {
+         FREE(*string);
+         *string = NULL;
+         *slength = 0;
+     }
+
+     return st;
+ }
+
+ /*
+  * findprefix - extract common prefix from cNFA
+  *
+  * Results are returned into the preallocated chr array string[], with
+  * *slength (which must be preset to zero) incremented for each chr.
+  */
+ static int                        /* regprefix return code */
+ findprefix(struct cnfa * cnfa,
+            struct colormap * cm,
+            chr *string,
+            size_t *slength)
+ {
+     int            st;
+     int            nextst;
+     color        thiscolor;
+     chr            c;
+     struct carc *ca;
+
+     /*
+      * The "pre" state must have only BOS/BOL outarcs, else pattern isn't
+      * anchored left.  If we have both BOS and BOL, they must go to the
+      * same next state.
+      */
+     st = cnfa->pre;
+     nextst = -1;
+     for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
+     {
+         if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
+         {
+             if (nextst == -1)
+                 nextst = ca->to;
+             else if (nextst != ca->to)
+                 return REG_NOMATCH;
+         }
+         else
+             return REG_NOMATCH;
+     }
+     if (nextst == -1)
+         return REG_NOMATCH;
+
+     /*
+      * Scan through successive states, stopping as soon as we find one with
+      * more than one acceptable transition character (either multiple colors
+      * on out-arcs, or a color with more than one member chr).
+      *
+      * We could find a state with multiple out-arcs that are all labeled with
+      * the same singleton color; this comes from patterns like "^ab(cde|cxy)".
+      * In that case we add the chr to the output string but then exit the loop
+      * with nextst == -1.  This leaves a little bit on the table: if the
+      * pattern is like "^ab(cde|cdy)", we won't notice that "d" could be added
+      * to the prefix.  But chasing multiple parallel state chains doesn't seem
+      * worth the trouble.
+      */
+     do
+     {
+         st = nextst;
+         nextst = -1;
+         thiscolor = COLORLESS;
+         for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
+         {
+             /* We ignore lookahead constraints */
+             if (ca->co >= cnfa->ncolors)
+                 continue;
+             /* We can also ignore BOS/BOL arcs */
+             if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
+                 continue;
+             /* ... but EOS/EOL arcs terminate the search */
+             if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1])
+             {
+                 thiscolor = COLORLESS;
+                 break;
+             }
+             if (thiscolor == COLORLESS)
+             {
+                 /* First plain outarc */
+                 thiscolor = ca->co;
+                 nextst = ca->to;
+             }
+             else if (thiscolor == ca->co)
+             {
+                 /* Another plain outarc for same color */
+                 nextst = -1;
+             }
+             else
+             {
+                 /* More than one plain outarc color terminates the search */
+                 thiscolor = COLORLESS;
+                 break;
+             }
+         }
+         /* Done if we didn't find exactly one color on plain outarcs */
+         if (thiscolor == COLORLESS)
+             break;
+         /* The color must be a singleton */
+         if (cm->cd[thiscolor].nchrs != 1)
+             break;
+
+         /*
+          * Identify the color's sole member chr and add it to the prefix
+          * string.  In general the colormap data structure doesn't provide a
+          * way to find color member chrs, except by trying GETCOLOR() on each
+          * possible chr value, which won't do at all.  However, for the cases
+          * we care about it should be sufficient to test the "firstchr" value,
+          * that is the first chr ever added to the color.  There are cases
+          * where this might no longer be a member of the color (so we do need
+          * to test), but none of them are likely to arise for a character that
+          * is a member of a common prefix.  If we do hit such a corner case,
+          * we just fall out without adding anything to the prefix string.
+          */
+         c = cm->cd[thiscolor].firstchr;
+         if (GETCOLOR(cm, c) != thiscolor)
+             break;
+
+         string[(*slength)++] = c;
+
+         /* Advance to next state, but only if we have a unique next state */
+     } while (nextst != -1);
+
+     /*
+      * If we ended at a state that only has EOS/EOL outarcs leading to the
+      * "post" state, then we have an exact-match string.  Note this is true
+      * even if the string is of zero length.
+      */
+     nextst = -1;
+     for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
+     {
+         if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1])
+         {
+             if (nextst == -1)
+                 nextst = ca->to;
+             else if (nextst != ca->to)
+             {
+                 nextst = -1;
+                 break;
+             }
+         }
+         else
+         {
+             nextst = -1;
+             break;
+         }
+     }
+     if (nextst == cnfa->post)
+         return REG_EXACT;
+
+     /*
+      * Otherwise, if we were unable to identify any prefix characters, say
+      * NOMATCH --- the pattern is anchored left, but doesn't specify any
+      * particular first character.
+      */
+     if (*slength > 0)
+         return REG_PREFIX;
+
+     return REG_NOMATCH;
+ }
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 96c77078c8b51f434ea9ddb9947abb5e93143410..074142e7985499f152e07ab0c4e1c1d0675c4042 100644
*** a/src/backend/utils/adt/regexp.c
--- b/src/backend/utils/adt/regexp.c
*************** build_regexp_split_result(regexp_matches
*** 1170,1172 ****
--- 1170,1237 ----
                                     Int32GetDatum(startpos + 1));
      }
  }
+
+ /*
+  * regexp_fixed_prefix - extract fixed prefix, if any, for a regexp
+  *
+  * The result is NULL if there is no fixed prefix, else a palloc'd string.
+  * If it is an exact match, not just a prefix, *exact is returned as TRUE.
+  */
+ char *
+ regexp_fixed_prefix(text *text_re, bool case_insensitive, Oid collation,
+                     bool *exact)
+ {
+     char       *result;
+     regex_t    *re;
+     int            cflags;
+     int            re_result;
+     pg_wchar   *str;
+     size_t        slen;
+     size_t        maxlen;
+     char        errMsg[100];
+
+     *exact = false;                /* default result */
+
+     /* Compile RE */
+     cflags = REG_ADVANCED;
+     if (case_insensitive)
+         cflags |= REG_ICASE;
+
+     re = RE_compile_and_cache(text_re, cflags, collation);
+
+     /* Examine it to see if there's a fixed prefix */
+     re_result = pg_regprefix(re, &str, &slen);
+
+     switch (re_result)
+     {
+         case REG_NOMATCH:
+             return NULL;
+
+         case REG_PREFIX:
+             /* continue with wchar conversion */
+             break;
+
+         case REG_EXACT:
+             *exact = true;
+             /* continue with wchar conversion */
+             break;
+
+         default:
+             /* re failed??? */
+             pg_regerror(re_result, re, errMsg, sizeof(errMsg));
+             ereport(ERROR,
+                     (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
+                      errmsg("regular expression failed: %s", errMsg)));
+             break;
+     }
+
+     /* Convert pg_wchar result back to database encoding */
+     maxlen = pg_database_encoding_max_length() * slen + 1;
+     result = (char *) palloc(maxlen);
+     slen = pg_wchar2mb_with_len(str, result, slen);
+     Assert(slen < maxlen);
+
+     free(str);
+
+     return result;
+ }
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 95e46276f0a8911758f4ec02b993193bf55eee15..b364bfa39d9c218fb5c96527b056777e175b935c 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** static Pattern_Prefix_Status
*** 5236,5253 ****
  regex_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
                     Const **prefix_const, Const **rest_const)
  {
-     char       *match;
-     int            pos,
-                 match_pos,
-                 prev_pos,
-                 prev_match_pos;
-     bool        have_leading_paren;
-     char       *patt;
-     char       *rest;
      Oid            typeid = patt_const->consttype;
!     bool        is_multibyte = (pg_database_encoding_max_length() > 1);
!     pg_locale_t locale = 0;
!     bool        locale_is_c = false;

      /*
       * Should be unnecessary, there are no bytea regex operators defined. As
--- 5236,5244 ----
  regex_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
                     Const **prefix_const, Const **rest_const)
  {
      Oid            typeid = patt_const->consttype;
!     char       *prefix;
!     bool        exact;

      /*
       * Should be unnecessary, there are no bytea regex operators defined. As
*************** regex_fixed_prefix(Const *patt_const, bo
*** 5259,5438 ****
                  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
           errmsg("regular-expression matching not supported on type bytea")));

!     if (case_insensitive)
!     {
!         /* If case-insensitive, we need locale info */
!         if (lc_ctype_is_c(collation))
!             locale_is_c = true;
!         else if (collation != DEFAULT_COLLATION_OID)
!         {
!             if (!OidIsValid(collation))
!             {
!                 /*
!                  * This typically means that the parser could not resolve a
!                  * conflict of implicit collations, so report it that way.
!                  */
!                 ereport(ERROR,
!                         (errcode(ERRCODE_INDETERMINATE_COLLATION),
!                          errmsg("could not determine which collation to use for regular expression"),
!                          errhint("Use the COLLATE clause to set the collation explicitly.")));
!             }
!             locale = pg_newlocale_from_collation(collation);
!         }
!     }
!
!     /* the right-hand const is type text for all of these */
!     patt = TextDatumGetCString(patt_const->constvalue);
!
!     /*
!      * Check for ARE director prefix.  It's worth our trouble to recognize
!      * this because similar_escape() used to use it, and some other code might
!      * still use it, to force ARE mode.
!      */
!     pos = 0;
!     if (strncmp(patt, "***:", 4) == 0)
!         pos = 4;
!
!     /* Pattern must be anchored left */
!     if (patt[pos] != '^')
!     {
!         rest = patt;
!
!         *prefix_const = NULL;
!         *rest_const = string_to_const(rest, typeid);
!
!         return Pattern_Prefix_None;
!     }
!     pos++;

!     /*
!      * If '|' is present in pattern, then there may be multiple alternatives
!      * for the start of the string.  (There are cases where this isn't so, for
!      * instance if the '|' is inside parens, but detecting that reliably is
!      * too hard.)
!      */
!     if (strchr(patt + pos, '|') != NULL)
      {
-         rest = patt;
-
          *prefix_const = NULL;
!         *rest_const = string_to_const(rest, typeid);

          return Pattern_Prefix_None;
      }

!     /* OK, allocate space for pattern */
!     match = palloc(strlen(patt) + 1);
!     prev_match_pos = match_pos = 0;
!
!     /*
!      * We special-case the syntax '^(...)$' because psql uses it.  But beware:
!      * sequences beginning "(?" are not what they seem, unless they're "(?:".
!      * (We must recognize that because of similar_escape().)
!      */
!     have_leading_paren = false;
!     if (patt[pos] == '(' &&
!         (patt[pos + 1] != '?' || patt[pos + 2] == ':'))
!     {
!         have_leading_paren = true;
!         pos += (patt[pos + 1] != '?' ? 1 : 3);
!     }
!
!     /* Scan remainder of pattern */
!     prev_pos = pos;
!     while (patt[pos])
!     {
!         int            len;
!
!         /*
!          * Check for characters that indicate multiple possible matches here.
!          * Also, drop out at ')' or '$' so the termination test works right.
!          */
!         if (patt[pos] == '.' ||
!             patt[pos] == '(' ||
!             patt[pos] == ')' ||
!             patt[pos] == '[' ||
!             patt[pos] == '^' ||
!             patt[pos] == '$')
!             break;
!
!         /* Stop if case-varying character (it's sort of a wildcard) */
!         if (case_insensitive &&
!           pattern_char_isalpha(patt[pos], is_multibyte, locale, locale_is_c))
!             break;
!
!         /*
!          * Check for quantifiers.  Except for +, this means the preceding
!          * character is optional, so we must remove it from the prefix too!
!          */
!         if (patt[pos] == '*' ||
!             patt[pos] == '?' ||
!             patt[pos] == '{')
!         {
!             match_pos = prev_match_pos;
!             pos = prev_pos;
!             break;
!         }
!         if (patt[pos] == '+')
!         {
!             pos = prev_pos;
!             break;
!         }
!
!         /*
!          * Normally, backslash quotes the next character.  But in AREs,
!          * backslash followed by alphanumeric is an escape, not a quoted
!          * character.  Must treat it as having multiple possible matches.
!          * Note: since only ASCII alphanumerics are escapes, we don't have to
!          * be paranoid about multibyte or collations here.
!          */
!         if (patt[pos] == '\\')
!         {
!             if (isalnum((unsigned char) patt[pos + 1]))
!                 break;
!             pos++;
!             if (patt[pos] == '\0')
!                 break;
!         }
!         /* save position in case we need to back up on next loop cycle */
!         prev_match_pos = match_pos;
!         prev_pos = pos;
!         /* must use encoding-aware processing here */
!         len = pg_mblen(&patt[pos]);
!         memcpy(&match[match_pos], &patt[pos], len);
!         match_pos += len;
!         pos += len;
!     }
!
!     match[match_pos] = '\0';
!     rest = &patt[pos];
!
!     if (have_leading_paren && patt[pos] == ')')
!         pos++;
!
!     if (patt[pos] == '$' && patt[pos + 1] == '\0')
!     {
!         rest = &patt[pos + 1];

!         *prefix_const = string_to_const(match, typeid);
!         *rest_const = string_to_const(rest, typeid);

!         pfree(patt);
!         pfree(match);

          return Pattern_Prefix_Exact;    /* pattern specifies exact match */
!     }
!
!     *prefix_const = string_to_const(match, typeid);
!     *rest_const = string_to_const(rest, typeid);
!
!     pfree(patt);
!     pfree(match);
!
!     if (match_pos > 0)
          return Pattern_Prefix_Partial;
-
-     return Pattern_Prefix_None;
  }

  Pattern_Prefix_Status
--- 5250,5281 ----
                  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
           errmsg("regular-expression matching not supported on type bytea")));

!     /* Use the regexp machinery to extract the prefix, if any */
!     prefix = regexp_fixed_prefix(DatumGetTextPP(patt_const->constvalue),
!                                  case_insensitive, collation,
!                                  &exact);

!     if (prefix == NULL)
      {
          *prefix_const = NULL;
!         *rest_const = patt_const;

          return Pattern_Prefix_None;
      }

!     *prefix_const = string_to_const(prefix, typeid);

!     if (exact)
!         *rest_const = string_to_const("", typeid);
!     else
!         *rest_const = patt_const;        /* bogus */

!     pfree(prefix);

+     if (exact)
          return Pattern_Prefix_Exact;    /* pattern specifies exact match */
!     else
          return Pattern_Prefix_Partial;
  }

  Pattern_Prefix_Status
diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index cec4b837cd15665da8e800ac1729245c21f1b22d..616c2c6450d80443361ca7e4660f52aaa266c9ad 100644
*** a/src/include/regex/regex.h
--- b/src/include/regex/regex.h
*************** typedef struct
*** 156,161 ****
--- 156,164 ----
  /* two specials for debugging and testing */
  #define REG_ATOI    101            /* convert error-code name to number */
  #define REG_ITOA    102            /* convert error-code number to name */
+ /* non-error result codes for pg_regprefix */
+ #define REG_PREFIX    (-1)        /* identified a common prefix */
+ #define REG_EXACT    (-2)        /* identified an exact match */



*************** typedef struct
*** 164,169 ****
--- 167,173 ----
   */
  extern int    pg_regcomp(regex_t *, const pg_wchar *, size_t, int, Oid);
  extern int    pg_regexec(regex_t *, const pg_wchar *, size_t, size_t, rm_detail_t *, size_t, regmatch_t[], int);
+ extern int    pg_regprefix(regex_t *, pg_wchar **, size_t *);
  extern void pg_regfree(regex_t *);
  extern size_t pg_regerror(int, const regex_t *, char *, size_t);
  extern void pg_set_regex_collation(Oid collation);
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index b8788506d417e83b3a293efae391ef63f8468e5b..e1e406f4eaa77720d88938cf150348ce903f88d8 100644
*** a/src/include/regex/regguts.h
--- b/src/include/regex/regguts.h
*************** struct colordesc
*** 199,217 ****
      color        sub;            /* open subcolor, if any; or free-chain ptr */
  #define  NOSUB     COLORLESS        /* value of "sub" when no open subcolor */
      struct arc *arcs;            /* chain of all arcs of this color */
      int            flags;            /* bit values defined next */
  #define  FREECOL 01                /* currently free */
  #define  PSEUDO  02                /* pseudocolor, no real chars */
! #define  UNUSEDCOLOR(cd) ((cd)->flags&FREECOL)
      union tree *block;            /* block of solid color, if any */
  };

  /*
   * The color map itself
   *
!  * Only the "tree" part is used at execution time, and that only via the
!  * GETCOLOR() macro.  Possibly that should be separated from the compile-time
!  * data.
   */
  struct colormap
  {
--- 199,219 ----
      color        sub;            /* open subcolor, if any; or free-chain ptr */
  #define  NOSUB     COLORLESS        /* value of "sub" when no open subcolor */
      struct arc *arcs;            /* chain of all arcs of this color */
+     chr            firstchr;        /* char first assigned to this color */
      int            flags;            /* bit values defined next */
  #define  FREECOL 01                /* currently free */
  #define  PSEUDO  02                /* pseudocolor, no real chars */
! #define  UNUSEDCOLOR(cd) ((cd)->flags & FREECOL)
      union tree *block;            /* block of solid color, if any */
  };

  /*
   * The color map itself
   *
!  * Much of the data in the colormap struct is only used at compile time.
!  * However, the bulk of the space usage is in the "tree" structure, so it's
!  * not clear that there's much point in converting the rest to a more compact
!  * form when compilation is finished.
   */
  struct colormap
  {
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 10634039a440e329d67a40e17dda944a4799911b..b43a4f1f6f39b389d70c2a09a6a230ea1aa598dd 100644
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
*************** extern Datum regexp_split_to_table(PG_FU
*** 568,573 ****
--- 568,575 ----
  extern Datum regexp_split_to_table_no_flags(PG_FUNCTION_ARGS);
  extern Datum regexp_split_to_array(PG_FUNCTION_ARGS);
  extern Datum regexp_split_to_array_no_flags(PG_FUNCTION_ARGS);
+ extern char *regexp_fixed_prefix(text *text_re, bool case_insensitive,
+                                  Oid collation, bool *exact);

  /* regproc.c */
  extern Datum regprocin(PG_FUNCTION_ARGS);
pgsql-hackers by date:
From: Peter Eisentraut
Date: 08 July 2012, 16:31:52
Subject: Re: Schema version management
From: Tom Lane
Date: 08 July 2012, 19:52:52
Subject: Re: Schema version management
Re: regex_fixed_prefix() is still a few bricks shy of a load - Mailing list pgsql-hackers

Previous

Next