Thread: Notes about fixing regexes and UTF-8 (yet again)
In bug #6457 it's pointed out that we *still* don't have full functionality for locale-dependent regexp behavior with UTF8 encoding. The reason is that there's old crufty code in regc_locale.c that only considers character codes up to 255 when searching for characters that should be considered "letters", "digits", etc. We could fix that, for some value of "fix", by iterating up to perhaps 0xFFFF when dealing with UTF8 encoding, but the time that would take is unappealing. Especially so considering that this code is executed afresh anytime we compile a regex that requires locale knowledge. I looked into the upstream Tcl code and observed that they deal with this by having hard-wired tables of which Unicode code points are to be considered letters etc. The tables are directly traceable to the Unicode standard (they provide a script to regenerate them from files available from unicode.org). Nonetheless, I do not find that approach appealing, mainly because we'd be risking deviating from the libc locale code's behavior within regexes when we follow it everywhere else. It seems entirely likely to me that a particular locale setting might consider only some of what Unicode says are letters to be letters. However, we could possibly compromise by using Unicode-derived tables as a guide to which code points are worth probing libc for. That is, assume that a utf8-based locale will never claim that some code is a letter that unicode.org doesn't think is a letter. That would cut the number of required probes by a pretty large factor. The other thing that seems worth doing is to install some caching. We could presumably assume that the behavior of iswupper() et al are fixed for the duration of a database session, so that we only need to run the probe loop once when first asked to create a cvec for a particular category. Thoughts, better ideas? regards, tom lane
On 16.02.2012 01:06, Tom Lane wrote: > In bug #6457 it's pointed out that we *still* don't have full > functionality for locale-dependent regexp behavior with UTF8 encoding. > The reason is that there's old crufty code in regc_locale.c that only > considers character codes up to 255 when searching for characters that > should be considered "letters", "digits", etc. We could fix that, for > some value of "fix", by iterating up to perhaps 0xFFFF when dealing with > UTF8 encoding, but the time that would take is unappealing. Especially > so considering that this code is executed afresh anytime we compile a > regex that requires locale knowledge. > > I looked into the upstream Tcl code and observed that they deal with > this by having hard-wired tables of which Unicode code points are to be > considered letters etc. The tables are directly traceable to the > Unicode standard (they provide a script to regenerate them from files > available from unicode.org). Nonetheless, I do not find that approach > appealing, mainly because we'd be risking deviating from the libc locale > code's behavior within regexes when we follow it everywhere else. > It seems entirely likely to me that a particular locale setting might > consider only some of what Unicode says are letters to be letters. > > However, we could possibly compromise by using Unicode-derived tables > as a guide to which code points are worth probing libc for. That is, > assume that a utf8-based locale will never claim that some code is a > letter that unicode.org doesn't think is a letter. That would cut the > number of required probes by a pretty large factor. > > The other thing that seems worth doing is to install some caching. > We could presumably assume that the behavior of iswupper() et al are > fixed for the duration of a database session, so that we only need to > run the probe loop once when first asked to create a cvec for a > particular category. > > Thoughts, better ideas? Here's a wild idea: keep the class of each codepoint in a hash table. Initialize it with all codepoints up to 0xFFFF. After that, whenever a string contains a character that's not in the hash table yet, query the class of that character, and add it to the hash table. Then recompile the whole regex and restart the matching engine. Recompiling is expensive, but if you cache the results for the session, it would probably be acceptable. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Here's a wild idea: keep the class of each codepoint in a hash table. > Initialize it with all codepoints up to 0xFFFF. After that, whenever a > string contains a character that's not in the hash table yet, query the > class of that character, and add it to the hash table. Then recompile > the whole regex and restart the matching engine. > Recompiling is expensive, but if you cache the results for the session, > it would probably be acceptable. Dunno ... recompiling is so expensive that I can't see this being a win; not to mention that it would require fundamental surgery on the regex code. In the Tcl implementation, no codepoints above U+FFFF have any locale properties (alpha/digit/punct/etc), period. Personally I'd not have a problem imposing the same limitation, so that dealing with stuff above that range isn't really a consideration anyway. regards, tom lane
On 02/17/2012 09:39 AM, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> Here's a wild idea: keep the class of each codepoint in a hash table. >> Initialize it with all codepoints up to 0xFFFF. After that, whenever a >> string contains a character that's not in the hash table yet, query the >> class of that character, and add it to the hash table. Then recompile >> the whole regex and restart the matching engine. >> Recompiling is expensive, but if you cache the results for the session, >> it would probably be acceptable. > Dunno ... recompiling is so expensive that I can't see this being a win; > not to mention that it would require fundamental surgery on the regex > code. > > In the Tcl implementation, no codepoints above U+FFFF have any locale > properties (alpha/digit/punct/etc), period. Personally I'd not have a > problem imposing the same limitation, so that dealing with stuff above > that range isn't really a consideration anyway. up to U+FFFF is the BMP which is described as containing "characters for almost all modern languages, and a large number of special characters." It seems very likely to be acceptable not to bother about the locale of code points in the supplementary planes. See <http://en.wikipedia.org/wiki/Plane_%28Unicode%29> for descriptions of which sets of characters are involved. cheers andrew
On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Here's a wild idea: keep the class of each codepoint in a hash table. > Initialize it with all codepoints up to 0xFFFF. After that, whenever a > string contains a character that's not in the hash table yet, query the > class of that character, and add it to the hash table. Then recompile the > whole regex and restart the matching engine. > > Recompiling is expensive, but if you cache the results for the session, it > would probably be acceptable. What if you did this ONCE and wrote the results to a file someplace? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Recompiling is expensive, but if you cache the results for the session, it >> would probably be acceptable. > What if you did this ONCE and wrote the results to a file someplace? That's still a cache, you've just defaulted on your obligation to think about what conditions require the cache to be flushed. (In the case at hand, the trigger for a cache rebuild would probably need to be a glibc package update, which we have no way of knowing about.) Before going much further with this, we should probably do some timings of 64K calls of iswupper and friends, just to see how bad a dumb implementation will be. regards, tom lane
On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What if you did this ONCE and wrote the results to a file someplace? > > That's still a cache, you've just defaulted on your obligation to think > about what conditions require the cache to be flushed. Yep. Unfortunately, I don't have a good idea how to handle that; I was hoping someone else did. > Before going much further with this, we should probably do some timings > of 64K calls of iswupper and friends, just to see how bad a dumb > implementation will be. Can't hurt. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Before going much further with this, we should probably do some timings >> of 64K calls of iswupper and friends, just to see how bad a dumb >> implementation will be. > Can't hurt. The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503) running Fedora 16 in en_US.utf8 locale, is that 64K iterations of pg_wc_isalpha or sibling functions requires a shade under 2ms. So this definitely justifies caching the values to avoid computing them more than once per session, but I'm not convinced there are grounds for trying harder than that. BTW, I am also a bit surprised to find out that this locale considers 48342 of those characters to satisfy isalpha(). Seems like a heck of a lot. But anyway we can forget my idea of trying to save work by incorporating a-priori assumptions about which Unicode codepoints are which --- it'll be faster to just iterate through them all, at least for that case. Maybe we should hard-wire some cases like digits, not sure. regards, tom lane
I wrote: > The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503) > running Fedora 16 in en_US.utf8 locale, is that 64K iterations of > pg_wc_isalpha or sibling functions requires a shade under 2ms. > So this definitely justifies caching the values to avoid computing > them more than once per session, but I'm not convinced there are > grounds for trying harder than that. And here's a poorly-tested draft patch for that. regards, tom lane diff --git a/src/backend/regex/regc_cvec.c b/src/backend/regex/regc_cvec.c index fb6f06b5243f50bfad2cefa5c016d4e842791a3d..98f3c597678b492dd59afcd956e5cdfecdba4f86 100644 *** a/src/backend/regex/regc_cvec.c --- b/src/backend/regex/regc_cvec.c *************** static void *** 77,82 **** --- 77,83 ---- addchr(struct cvec * cv, /* character vector */ chr c) /* character to add */ { + assert(cv->nchrs < cv->chrspace); cv->chrs[cv->nchrs++] = (chr) c; } diff --git a/src/backend/regex/regc_locale.c b/src/backend/regex/regc_locale.c index 6cf27958b1545a61fba01e76dc4d37aca32789dc..44ce582bdad1a7d830d4122cada45a39c188981c 100644 *** a/src/backend/regex/regc_locale.c --- b/src/backend/regex/regc_locale.c *************** static const struct cname *** 351,356 **** --- 351,366 ---- /* + * We do not use the hard-wired Unicode classification tables that Tcl does. + * This is because (a) we need to deal with other encodings besides Unicode, + * and (b) we want to track the behavior of the libc locale routines as + * closely as possible. For example, it wouldn't be unreasonable for a + * locale to not consider every Unicode letter as a letter. So we build + * character classification cvecs by asking libc, even for Unicode. + */ + + + /* * element - map collating-element name to celt */ static celt *************** cclass(struct vars * v, /* context */ *** 498,503 **** --- 508,514 ---- int cases) /* case-independent? */ { size_t len; + const struct cvec *ccv = NULL; struct cvec *cv = NULL; const char * const *namePtr; int i, *************** cclass(struct vars * v, /* context */ *** 549,626 **** /* * Now compute the character class contents. - * - * For the moment, assume that only char codes < 256 can be in these - * classes. */ switch ((enum classes) index) { case CC_PRINT: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i <= UCHAR_MAX; i++) ! { ! if (pg_wc_isprint((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_ALNUM: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i <= UCHAR_MAX; i++) ! { ! if (pg_wc_isalnum((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_ALPHA: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i <= UCHAR_MAX; i++) ! { ! if (pg_wc_isalpha((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_ASCII: cv = getcvec(v, 0, 1); if (cv) addrange(cv, 0, 0x7f); break; case CC_BLANK: cv = getcvec(v, 2, 0); addchr(cv, '\t'); addchr(cv, ' '); break; case CC_CNTRL: cv = getcvec(v, 0, 2); addrange(cv, 0x0, 0x1f); addrange(cv, 0x7f, 0x9f); break; case CC_DIGIT: ! cv = getcvec(v, 0, 1); ! if (cv) ! addrange(cv, (chr) '0', (chr) '9'); break; case CC_PUNCT: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i <= UCHAR_MAX; i++) ! { ! if (pg_wc_ispunct((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_XDIGIT: cv = getcvec(v, 0, 3); if (cv) { --- 560,608 ---- /* * Now compute the character class contents. */ switch ((enum classes) index) { case CC_PRINT: ! ccv = pg_ctype_get_cache(pg_wc_isprint); break; case CC_ALNUM: ! ccv = pg_ctype_get_cache(pg_wc_isalnum); break; case CC_ALPHA: ! ccv = pg_ctype_get_cache(pg_wc_isalpha); break; case CC_ASCII: + /* hard-wired meaning */ cv = getcvec(v, 0, 1); if (cv) addrange(cv, 0, 0x7f); break; case CC_BLANK: + /* hard-wired meaning */ cv = getcvec(v, 2, 0); addchr(cv, '\t'); addchr(cv, ' '); break; case CC_CNTRL: + /* hard-wired meaning */ cv = getcvec(v, 0, 2); addrange(cv, 0x0, 0x1f); addrange(cv, 0x7f, 0x9f); break; case CC_DIGIT: ! ccv = pg_ctype_get_cache(pg_wc_isdigit); break; case CC_PUNCT: ! ccv = pg_ctype_get_cache(pg_wc_ispunct); break; case CC_XDIGIT: + /* + * It's not clear how to define this in non-western locales, and + * even less clear that there's any particular use in trying. + * So just hard-wire the meaning. + */ cv = getcvec(v, 0, 3); if (cv) { *************** cclass(struct vars * v, /* context */ *** 630,679 **** } break; case CC_SPACE: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i <= UCHAR_MAX; i++) ! { ! if (pg_wc_isspace((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_LOWER: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i <= UCHAR_MAX; i++) ! { ! if (pg_wc_islower((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_UPPER: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i <= UCHAR_MAX; i++) ! { ! if (pg_wc_isupper((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_GRAPH: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i <= UCHAR_MAX; i++) ! { ! if (pg_wc_isgraph((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; } if (cv == NULL) ERR(REG_ESPACE); return cv; --- 612,643 ---- } break; case CC_SPACE: ! ccv = pg_ctype_get_cache(pg_wc_isspace); break; case CC_LOWER: ! ccv = pg_ctype_get_cache(pg_wc_islower); break; case CC_UPPER: ! ccv = pg_ctype_get_cache(pg_wc_isupper); break; case CC_GRAPH: ! ccv = pg_ctype_get_cache(pg_wc_isgraph); break; } + + /* If we got a cached cvec, copy it to a freeable one. */ + if (ccv != NULL) + { + cv = getcvec(v, ccv->nchrs, ccv->nranges); + if (cv) + { + cv->nchrs = ccv->nchrs; + memcpy(cv->chrs, ccv->chrs, cv->nchrs * sizeof(chr)); + cv->nranges = ccv->nranges; + memcpy(cv->ranges, ccv->ranges, cv->nranges * sizeof(chr) * 2); + } + } + if (cv == NULL) ERR(REG_ESPACE); return cv; diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c index 7c010e372858065bedbc47382a84cb3650e03efa..9ea7a81b176de59ed989238b1849672108374d97 100644 *** a/src/backend/regex/regc_pg_locale.c --- b/src/backend/regex/regc_pg_locale.c *************** *** 1,7 **** /*------------------------------------------------------------------------- * * regc_pg_locale.c ! * ctype functions adapted to work on pg_wchar (a/k/a chr) * * This file is #included by regcomp.c; it's not meant to compile standalone. * --- 1,8 ---- /*------------------------------------------------------------------------- * * regc_pg_locale.c ! * ctype functions adapted to work on pg_wchar (a/k/a chr), ! * and functions to cache the results of wholesale ctype probing. * * This file is #included by regcomp.c; it's not meant to compile standalone. * *************** typedef enum *** 72,77 **** --- 73,79 ---- static PG_Locale_Strategy pg_regex_strategy; static pg_locale_t pg_regex_locale; + static Oid pg_regex_collation; /* * Hard-wired character properties for C locale *************** pg_set_regex_collation(Oid collation) *** 233,238 **** --- 235,241 ---- /* C/POSIX collations use this path regardless of database encoding */ pg_regex_strategy = PG_REGEX_LOCALE_C; pg_regex_locale = 0; + pg_regex_collation = C_COLLATION_OID; } else { *************** pg_set_regex_collation(Oid collation) *** 275,280 **** --- 278,285 ---- else pg_regex_strategy = PG_REGEX_LOCALE_1BYTE; } + + pg_regex_collation = collation; } } *************** pg_wc_tolower(pg_wchar c) *** 656,658 **** --- 661,873 ---- } return 0; /* can't get here, but keep compiler quiet */ } + + + /* + * These functions cache the results of probing libc's ctype behavior for + * all character codes of interest in a given encoding/collation. The + * result is provided as a "struct cvec", but notice that the representation + * is a touch different from a cvec created by regc_cvec.c: we allocate the + * chrs[] and ranges[] arrays separately from the struct so that we can + * realloc them larger at need. This is okay since the cvecs made here + * should never be freed by freecvec(). + * + * We use malloc not palloc since we mustn't lose control on out-of-memory; + * the main regex code expects us to return a failure indication instead. + */ + + typedef int (*pg_wc_probefunc) (pg_wchar c); + + typedef struct pg_ctype_cache + { + pg_wc_probefunc probefunc; /* pg_wc_isalpha or a sibling */ + Oid collation; /* collation this entry is for */ + struct cvec cv; /* cache entry contents */ + struct pg_ctype_cache *next; /* chain link */ + } pg_ctype_cache; + + static pg_ctype_cache *pg_ctype_cache_list = NULL; + + /* + * Add a chr or range to pcc->cv; return false if run out of memory + */ + static bool + store_match(pg_ctype_cache *pcc, pg_wchar chr1, int nchrs) + { + chr *newchrs; + + if (nchrs > 1) + { + if (pcc->cv.nranges >= pcc->cv.rangespace) + { + pcc->cv.rangespace *= 2; + newchrs = (chr *) realloc(pcc->cv.ranges, + pcc->cv.rangespace * sizeof(chr) * 2); + if (newchrs == NULL) + return false; + pcc->cv.ranges = newchrs; + } + pcc->cv.ranges[pcc->cv.nranges * 2] = chr1; + pcc->cv.ranges[pcc->cv.nranges * 2 + 1] = chr1 + nchrs - 1; + pcc->cv.nranges++; + } + else + { + assert(nchrs == 1); + if (pcc->cv.nchrs >= pcc->cv.chrspace) + { + pcc->cv.chrspace *= 2; + newchrs = (chr *) realloc(pcc->cv.chrs, + pcc->cv.chrspace * sizeof(chr)); + if (newchrs == NULL) + return false; + pcc->cv.chrs = newchrs; + } + pcc->cv.chrs[pcc->cv.nchrs++] = chr1; + } + return true; + } + + /* + * Given a probe function (e.g., pg_wc_isalpha) get a struct cvec for all + * chrs satisfying the probe function. The active collation is the one + * previously set by pg_set_regex_collation. Returns NULL if out of memory. + * + * Note that the result must NOT be freed or modified by caller. + */ + static const struct cvec * + pg_ctype_get_cache(pg_wc_probefunc probefunc) + { + pg_ctype_cache *pcc; + pg_wchar max_chr; + pg_wchar cur_chr; + int seq; + chr *newchrs; + + /* + * Do we already have the answer cached? + */ + for (pcc = pg_ctype_cache_list; pcc != NULL; pcc = pcc->next) + { + if (pcc->probefunc == probefunc && + pcc->collation == pg_regex_collation) + return &pcc->cv; + } + + /* + * Nope, so initialize some workspace ... + */ + pcc = (pg_ctype_cache *) malloc(sizeof(pg_ctype_cache)); + if (pcc == NULL) + return NULL; + pcc->probefunc = probefunc; + pcc->collation = pg_regex_collation; + pcc->cv.nchrs = 0; + pcc->cv.chrspace = 256; + pcc->cv.chrs = (chr *) malloc(pcc->cv.chrspace * sizeof(chr)); + pcc->cv.nranges = 0; + pcc->cv.rangespace = 256; + pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2); + if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL) + goto out_of_memory; + + /* + * Decide how many character codes we ought to look through. For C locale + * there's no need to go further than 127. Otherwise, if the encoding is + * UTF8 go up to 0xFFFF (the end of the Basic Multilingual Plane). + * Otherwise, go up to 255. These limits are interrelated with + * restrictions discussed at the head of this file. + */ + switch (pg_regex_strategy) + { + case PG_REGEX_LOCALE_C: + max_chr = (pg_wchar) 127; + break; + case PG_REGEX_LOCALE_WIDE: + case PG_REGEX_LOCALE_WIDE_L: + max_chr = (pg_wchar) 0xFFFF; + break; + case PG_REGEX_LOCALE_1BYTE: + case PG_REGEX_LOCALE_1BYTE_L: + max_chr = (pg_wchar) UCHAR_MAX; + break; + default: + max_chr = 0; /* can't get here, but keep compiler quiet */ + break; + } + + /* + * And scan 'em ... + */ + seq = 0; /* number of consecutive matches */ + + for (cur_chr = 0; cur_chr <= max_chr; cur_chr++) + { + if ((*probefunc) (cur_chr)) + seq++; + else if (seq > 0) + { + if (!store_match(pcc, cur_chr - seq, seq)) + goto out_of_memory; + seq = 0; + } + } + + if (seq > 0) + if (!store_match(pcc, cur_chr - seq, seq)) + goto out_of_memory; + + /* + * We might have allocated more memory than needed, if so free it + */ + if (pcc->cv.nchrs == 0) + { + free(pcc->cv.chrs); + pcc->cv.chrs = NULL; + pcc->cv.chrspace = 0; + } + else if (pcc->cv.nchrs < pcc->cv.chrspace) + { + newchrs = (chr *) realloc(pcc->cv.chrs, + pcc->cv.nchrs * sizeof(chr)); + if (newchrs == NULL) + goto out_of_memory; + pcc->cv.chrs = newchrs; + pcc->cv.chrspace = pcc->cv.nchrs; + } + if (pcc->cv.nranges == 0) + { + free(pcc->cv.ranges); + pcc->cv.ranges = NULL; + pcc->cv.rangespace = 0; + } + else if (pcc->cv.nranges < pcc->cv.rangespace) + { + newchrs = (chr *) realloc(pcc->cv.ranges, + pcc->cv.nranges * sizeof(chr) * 2); + if (newchrs == NULL) + goto out_of_memory; + pcc->cv.ranges = newchrs; + pcc->cv.rangespace = pcc->cv.nranges; + } + + /* + * Success, link it into cache chain + */ + pcc->next = pg_ctype_cache_list; + pg_ctype_cache_list = pcc; + + return &pcc->cv; + + /* + * Failure, clean up + */ + out_of_memory: + if (pcc->cv.chrs) + free(pcc->cv.chrs); + if (pcc->cv.ranges) + free(pcc->cv.ranges); + free(pcc); + + return NULL; + }
I don't believe it is valid to ignore CJK characters above U+20000. If it is used for names, it will be stored in the database. If the behaviour is different from characters below U+FFFF, you will get a bug report in meanwhile. see CJK Extension B, C, and D from http://www.unicode.org/charts/ Also, there are some code points that could be regarded alphabet and numbers http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols On the other hand, it is ok if processing of characters above U+10000 is very slow, as far as properly processed, because it is considered rare. On 2012/02/17, at 23:56, Andrew Dunstan wrote: > > > On 02/17/2012 09:39 AM, Tom Lane wrote: >> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >>> Here's a wild idea: keep the class of each codepoint in a hash table. >>> Initialize it with all codepoints up to 0xFFFF. After that, whenever a >>> string contains a character that's not in the hash table yet, query the >>> class of that character, and add it to the hash table. Then recompile >>> the whole regex and restart the matching engine. >>> Recompiling is expensive, but if you cache the results for the session, >>> it would probably be acceptable. >> Dunno ... recompiling is so expensive that I can't see this being a win; >> not to mention that it would require fundamental surgery on the regex >> code. >> >> In the Tcl implementation, no codepoints above U+FFFF have any locale >> properties (alpha/digit/punct/etc), period. Personally I'd not have a >> problem imposing the same limitation, so that dealing with stuff above >> that range isn't really a consideration anyway. > > > up to U+FFFF is the BMP which is described as containing "characters for almost all modern languages, and a large numberof special characters." It seems very likely to be acceptable not to bother about the locale of code points in thesupplementary planes. > > See <http://en.wikipedia.org/wiki/Plane_%28Unicode%29> for descriptions of which sets of characters are involved. > > > cheers > > andrew > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
NISHIYAMA Tomoaki <tomoakin@staff.kanazawa-u.ac.jp> writes: > I don't believe it is valid to ignore CJK characters above U+20000. > If it is used for names, it will be stored in the database. > If the behaviour is different from characters below U+FFFF, you will > get a bug report in meanwhile. I am skeptical that there is enough usage of such things to justify slowing regexp operations down for everybody. Note that it's not only the initial probe of libc behavior that's at stake here --- the more character codes are treated as letters, the larger the DFA transition maps get and the more time it takes to build them. So I'm unexcited about just cranking up the loop limit in pg_ctype_get_cache. > On the other hand, it is ok if processing of characters above U+10000 > is very slow, as far as properly processed, because it is considered > rare. Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to iswalpha() and friends, rather than being included in the statically-constructed DFA maps. The cutoff point could likely be a lot less than U+FFFF, too, thereby saving storage and map build time all round. However, that "we" above is the editorial "we". *I* am not going to do this. Somebody who actually has a need for it should step up. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Yeah, it's conceivable that we could implement something whereby > characters with codes above some cutoff point are handled via runtime > calls to iswalpha() and friends, rather than being included in the > statically-constructed DFA maps. The cutoff point could likely be a lot > less than U+FFFF, too, thereby saving storage and map build time all > round. It's been proposed to build a “regexp” type in PostgreSQL which would store the DFA directly and provides some way to run that DFA out of its “storage” without recompiling. Would such a mechanism be useful here? Would it be useful only when storing the regexp in a column somewhere then applying it in the query from there (so most probably adding a join or subquery somewhere)? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Yeah, it's conceivable that we could implement something whereby >> characters with codes above some cutoff point are handled via runtime >> calls to iswalpha() and friends, rather than being included in the >> statically-constructed DFA maps. The cutoff point could likely be a lot >> less than U+FFFF, too, thereby saving storage and map build time all >> round. > It's been proposed to build a “regexp” type in PostgreSQL which would > store the DFA directly and provides some way to run that DFA out of its > “storage” without recompiling. > Would such a mechanism be useful here? No, this is about what goes into the DFA representation in the first place, not about how we store it and reuse it. regards, tom lane
I wrote: > And here's a poorly-tested draft patch for that. I've done some more testing now, and am satisfied that this works as intended. However, some crude performance testing suggests that people might be annoyed with it. As an example, in 9.1 with pl_PL.utf8 locale, I see this:select 'aaaaaaaaaa' ~ '\w\w\w\w\w\w\w\w\w\w\w'; taking perhaps 0.75 ms on first execution and 0.4 ms on subsequent executions, the difference being the time needed to compile and cache the DFA representation of the regexp. With the patch, the numbers are more like 5 ms and 0.4 ms, meaning the compilation time has gone up by something near a factor of 10, though AFAICT execution time hasn't moved. It's hard to tell how significant that would be to real-world queries, but in the worst case where our caching of regexps doesn't help much, it could be disastrous. All of the extra time is in manipulation of the much larger number of DFA arcs required to represent all the additional character codes that are being considered to be letters. Perhaps I'm being overly ASCII-centric, but I'm afraid to commit this as-is; I think the number of people who are hurt by the performance degradation will be greatly larger than the number who are glad because characters in $random_alphabet are now seen to be letters. I think an actually workable solution will require something like what I speculated about earlier: > Yeah, it's conceivable that we could implement something whereby > characters with codes above some cutoff point are handled via runtime > calls to iswalpha() and friends, rather than being included in the > statically-constructed DFA maps. The cutoff point could likely be a lot > less than U+FFFF, too, thereby saving storage and map build time all > round. In the meantime, I still think the caching logic is worth having, and we could at least make some people happy if we selected a cutoff point somewhere between U+FF and U+FFFF. I don't have any strong ideas about what a good compromise cutoff would be. One possibility is U+7FF, which corresponds to the limit of what fits in 2-byte UTF8; but I don't know if that corresponds to any significant dropoff in frequency of usage. regards, tom lane
On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah, it's conceivable that we could implement something whereby >> characters with codes above some cutoff point are handled via runtime >> calls to iswalpha() and friends, rather than being included in the >> statically-constructed DFA maps. The cutoff point could likely be a lot >> less than U+FFFF, too, thereby saving storage and map build time all >> round. > > In the meantime, I still think the caching logic is worth having, and > we could at least make some people happy if we selected a cutoff point > somewhere between U+FF and U+FFFF. I don't have any strong ideas about > what a good compromise cutoff would be. One possibility is U+7FF, which > corresponds to the limit of what fits in 2-byte UTF8; but I don't know > if that corresponds to any significant dropoff in frequency of usage. The problem, of course, is that this probably depends quite a bit on what language you happen to be using. For some languages, it won't matter whether you cut it off at U+FF or U+7FF; while for others even U+FFFF might not be enough. So I think this is one of those cases where it's somewhat meaningless to talk about frequency of usage. In theory you can imagine a regular expression engine where these decisions can be postponed until we see the string we're matching against. IOW, your DFA ends up with state transitions for characters specifically named, plus a state transition for "anything else that's a letter", plus a state transition for "anything else not otherwise specified". Then you only need to test the letters that actually appear in the target string, rather than all of the ones that might appear there. But implementing that could be quite a lot of work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 19, 2012 at 04:33, Robert Haas <robertmhaas@gmail.com> wrote:
Does it make sense for regexps to have collations?
On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, it's conceivable that we could implement something whereby
>> characters with codes above some cutoff point are handled via runtime
>> calls to iswalpha() and friends, rather than being included in the
>> statically-constructed DFA maps. The cutoff point could likely be a lot
>> less than U+FFFF, too, thereby saving storage and map build time all
>> round.
>
> In the meantime, I still think the caching logic is worth having, and
> we could at least make some people happy if we selected a cutoff point
> somewhere between U+FF and U+FFFF. I don't have any strong ideas about
> what a good compromise cutoff would be. One possibility is U+7FF, which
> corresponds to the limit of what fits in 2-byte UTF8; but I don't know
> if that corresponds to any significant dropoff in frequency of usage.
The problem, of course, is that this probably depends quite a bit on
what language you happen to be using. For some languages, it won't
matter whether you cut it off at U+FF or U+7FF; while for others even
U+FFFF might not be enough. So I think this is one of those cases
where it's somewhat meaningless to talk about frequency of usage.
Does it make sense for regexps to have collations?
On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja <vikreykja@gmail.com> wrote: > Does it make sense for regexps to have collations? As I understand it, collations determine the sort-ordering of strings.Regular expressions don't care about that. Why doyou ask? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 19, 2012 at 05:03, Robert Haas <robertmhaas@gmail.com> wrote:
Perhaps I used the wrong term, but I was thinking the locale could tell us what alphabet we're dealing with. So a regexp using en_US would give different word-boundary results from one using zh_CN.
On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja <vikreykja@gmail.com> wrote:As I understand it, collations determine the sort-ordering of strings.
> Does it make sense for regexps to have collations?
Regular expressions don't care about that. Why do you ask?
Perhaps I used the wrong term, but I was thinking the locale could tell us what alphabet we're dealing with. So a regexp using en_US would give different word-boundary results from one using zh_CN.
Robert Haas <robertmhaas@gmail.com> writes: > In theory you can imagine a regular expression engine where these > decisions can be postponed until we see the string we're matching > against. IOW, your DFA ends up with state transitions for characters > specifically named, plus a state transition for "anything else that's > a letter", plus a state transition for "anything else not otherwise > specified". Then you only need to test the letters that actually > appear in the target string, rather than all of the ones that might > appear there. > But implementing that could be quite a lot of work. Yeah, not to mention slow. The difficulty is overlapping sets of characters. As a simple example, if your regex refers to 3, 7, [[:digit:]], X, and [[:alnum:]], then you end up needing five distinct "colors": 3, 7, X, all digits that aren't 3 or 7, all alphanumerics that aren't any of the preceding. And state transitions for the digit and alnum cases had better mention all and only the correct colors. I've been tracing through the logic this evening, and it works pretty simply given that all named character classes are immediately expanded out to their component characters. If we are going to try to keep the classes in some kind of symbolic form, it's a lot messier. In particular, I think your sketch above would lead to having to test every character against iswdigit and iswalnum at runtime, which would be disastrous performancewise. I'd like to at least avoid that for the shorter (and presumably more common) UTF8 codes. regards, tom lane
Vik Reykja <vikreykja@gmail.com> writes: > On Sun, Feb 19, 2012 at 05:03, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja <vikreykja@gmail.com> wrote: >>> Does it make sense for regexps to have collations? >> As I understand it, collations determine the sort-ordering of strings. >> Regular expressions don't care about that. Why do you ask? > Perhaps I used the wrong term, but I was thinking the locale could tell us > what alphabet we're dealing with. So a regexp using en_US would give > different word-boundary results from one using zh_CN. Our interpretation of a "collation" is that it sets both LC_COLLATE and LC_CTYPE. Regexps may not care about the first but they definitely care about the second. This is why the stuff in regc_pg_locale.c pays attention to collation. regards, tom lane
On Sat, Feb 18, 2012 at 11:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> In theory you can imagine a regular expression engine where these >> decisions can be postponed until we see the string we're matching >> against. IOW, your DFA ends up with state transitions for characters >> specifically named, plus a state transition for "anything else that's >> a letter", plus a state transition for "anything else not otherwise >> specified". Then you only need to test the letters that actually >> appear in the target string, rather than all of the ones that might >> appear there. > >> But implementing that could be quite a lot of work. > > Yeah, not to mention slow. The difficulty is overlapping sets of > characters. As a simple example, if your regex refers to 3, 7, > [[:digit:]], X, and [[:alnum:]], then you end up needing five distinct > "colors": 3, 7, X, all digits that aren't 3 or 7, all alphanumerics > that aren't any of the preceding. And state transitions for the digit > and alnum cases had better mention all and only the correct colors. Yeah, that's unfortunate. On the other hand, if you don't use colors for this case, aren't you going to need, for each DFA state, a gigantic lookup table that includes every character in the server encoding? Even if you've got plenty of memory, initializing such a beast seems awfully expensive, and it might not do very good things for cache locality, either. > I've been tracing through the logic this evening, and it works pretty > simply given that all named character classes are immediately expanded > out to their component characters. If we are going to try to keep > the classes in some kind of symbolic form, it's a lot messier. In > particular, I think your sketch above would lead to having to test > every character against iswdigit and iswalnum at runtime, which would > be disastrous performancewise. I'd like to at least avoid that for the > shorter (and presumably more common) UTF8 codes. Hmm, but you could cache that information. Instead of building a cache that covers every possible character that might appear in the target string, you can just cache the results for the code points that you actually see. Yet another option would be to dictate that the cache can't holes - it will always include information for every code point from 0 up to some value X. If we see a code point in the target string which is greater than X, then we extend the cache out as far as that code point. That way, people who are using only code points out to U+FF (or even U+7F) don't pay the cost of building a large cache, but people who need it can get correct behavior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote: > > What if you did this ONCE and wrote the results to a file someplace? > > That's still a cache, you've just defaulted on your obligation to think > about what conditions require the cache to be flushed. (In the case at > hand, the trigger for a cache rebuild would probably need to be a glibc > package update, which we have no way of knowing about.) We basically hardwire locale behavior at initdb time, so computing this then and storing it somewhere for eternity would be consistent.
Peter Eisentraut <peter_e@gmx.net> writes: > On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote: >> That's still a cache, you've just defaulted on your obligation to think >> about what conditions require the cache to be flushed. (In the case at >> hand, the trigger for a cache rebuild would probably need to be a glibc >> package update, which we have no way of knowing about.) > We basically hardwire locale behavior at initdb time, so computing this > then and storing it somewhere for eternity would be consistent. Well, only if we could cache every locale-related libc inquiry we ever make. Locking down only part of the behavior does not sound like a plan. regards, tom lane