Thread: Re: make_greater_string() does not return a string in some cases
Hello, Could you let me go on with this topic? It is hard to ignore this glitch for us using CJK - Chinese, Japanese, and Korean - characters on databse.. Maybe.. Saying on Japanese under the standard usage, about a hundred characters out of seven thousand make make_greater_string() fail. This is not so frequent to happen but also not as rare as ignorable. I think this glitch is caused because the method to derive the `next character' is fundamentally a secret of each encoding but now it is done in make_greater_string() using the method extended from that of 1 byte ASCII charset for all encodings together. So, I think it is reasonable that encoding info table (struct pg_wchar_tbl) holds the function to do that. How about this idea? Points to realize this follows, - pg_wchar_tbl@pg_wchar.c has new element `charinc' that holds a function to increment a character of this encoding. - Basically, the value of charinc is a `generic' increment function that does what make_greater_string() does in currentimplement. - make_greater_string() now uses charinc for database encoding to increment characters instead of the code directly writtenin it. - Give UTF-8 a special increment function. As a consequence of this modification, make_greater_string() looks somewhat simple thanks to disappearing of the sequence that handles bare bytes in string. And doing `increment character' with the knowledge of the encoding can be straightforward and light and backtrack-free, and have fewer glitches than the generic method. # But the process for BYTEAOID remains there dissapointingly. There still remains some glitches but I think it is overdo to do conversion that changes the length of the character. Only 5 points out of 17 thousands (in current method, roughly for all BMP characters) remains, and none of them are not Japanese character :-) The attached patch is sample implement of this idea. What do you think about this patch? -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 10b73fb..4151ce2 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -5502,6 +5502,16 @@ pattern_selectivity(Const *patt, Pattern_Type ptype)/* + * This function is "character increment" function for bytea used in + * make_greater_string() that has same interface with pg_wchar_tbl.charinc. + */ +static bool byte_increment(unsigned char *ptr, int len) +{ + (*ptr)--; + return true; +} + +/* * Try to generate a string greater than the given string or any * string it is a prefix of. If successful, return apalloc'd string * in the form of a Const node; else return NULL. @@ -5540,6 +5550,7 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) int len; Datum cmpstr; text *cmptxt = NULL; + character_incrementer charincfunc; /* * Get a modifiable copy of the prefix string in C-string format, and set @@ -5601,27 +5612,38 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) } } + if (datatype != BYTEAOID) + charincfunc = pg_database_encoding_character_incrementer(); + else + charincfunc = &byte_increment; + while (len > 0) { - unsigned char *lastchar = (unsigned char *) (workstr + len - 1); - unsigned char savelastchar = *lastchar; + int charlen; + unsigned char *lastchar; + unsigned char savelastbyte; + Const *workstr_const; + + if (datatype == BYTEAOID) + charlen = 1; + else + charlen = len - pg_mbcliplen(workstr, len, len - 1); + + lastchar = (unsigned char *) (workstr + len - charlen); /* - * Try to generate a larger string by incrementing the last byte. + * savelastbyte has meaning only for datatype == BYTEAOID */ - while (*lastchar < (unsigned char) 255) - { - Const *workstr_const; + savelastbyte = *lastchar; - (*lastchar)++; + /* + * Try to generate a larger string by incrementing the last byte or + * character. + */ + if (charincfunc(lastchar, charlen)) { if (datatype != BYTEAOID) - { - /* do not generate invalid encoding sequences */ - if (!pg_verifymbstr(workstr, len, true)) - continue; workstr_const = string_to_const(workstr, datatype); - } else workstr_const = string_to_bytea_const(workstr, len); @@ -5636,26 +5658,17 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) pfree(workstr); return workstr_const; } - + /* No good, release unusable value and try again */ pfree(DatumGetPointer(workstr_const->constvalue)); pfree(workstr_const); } - /* restore last byte so we don't confuse pg_mbcliplen */ - *lastchar = savelastchar; - /* - * Truncate off the last character, which might be more than 1 byte, - * depending on the character encoding. + * Truncate off the last character or restore last byte for BYTEA. */ - if (datatype != BYTEAOID && pg_database_encoding_max_length() > 1) - len = pg_mbcliplen(workstr, len, len - 1); - else - len -= 1; - - if (datatype != BYTEAOID) - workstr[len] = '\0'; + len -= charlen; + workstr[len] = (datatype != BYTEAOID ? '\0' : savelastbyte); } /* Failed... */ diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c index 5b0cf62..1d6aee0 100644 --- a/src/backend/utils/mb/wchar.c +++ b/src/backend/utils/mb/wchar.c @@ -935,6 +935,85 @@ pg_gb18030_dsplen(const unsigned char *s)/* *------------------------------------------------------------------- + * multibyte character incrementer + * + * These functions accept "charptr", a pointer to the first byte of a + * maybe-multibyte character. Try `increment' the character and return true if + * successed. If these functions returns false, the character is not modified. + * ------------------------------------------------------------------- + */ + +static bool pg_generic_charinc(unsigned char *charptr, int len) +{ + unsigned char *lastchar = (unsigned char *) (charptr + len - 1); + unsigned char savelastchar = *lastchar; + const char *const_charptr = (const char *)charptr; + + while (*lastchar < (unsigned char) 255) + { + (*lastchar)++; + if (!pg_verifymbstr(const_charptr, len, true)) + continue; + return true; + } + + *lastchar = savelastchar; + return false; +} + +static bool pg_utf8_increment(unsigned char *charptr, int length) +{ + unsigned char a; + unsigned char bak[4]; + + memcpy(bak, charptr, length); + switch (length) + { + default: + /* reject lengths 5 and 6 for now */ + return false; + case 4: + a = charptr[3]; + if (a < 0xBF) + { + charptr[3]++; + break; + } + charptr[3] = 0x80; + /* FALL THRU */ + case 3: + a = charptr[2]; + if (a < 0xBF) + { + charptr[2]++; + break; + } + charptr[2] = 0x80; + /* FALL THRU */ + case 2: + a = charptr[1]; + if ((*charptr == 0xed && a < 0x9F) || a < 0xBF) + { + charptr[1]++; + break; + } + charptr[1] = 0x80; + /* FALL THRU */ + case 1: + a = *charptr; + if (a == 0x7F || a == 0xDF || a == 0xEF || a == 0xF7) { + memcpy(charptr, bak, length); + return false; + } + charptr[0]++; + break; + } + + return pg_utf8_islegal(charptr, length); +} + +/* + *------------------------------------------------------------------- * multibyte sequence validators * * These functionsaccept "s", a pointer to the first byte of a string, @@ -1341,48 +1420,48 @@ pg_utf8_islegal(const unsigned char *source, int length) *-------------------------------------------------------------------*/pg_wchar_tbl pg_wchar_table[] = { - {pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifier, 1}, /* PG_SQL_ASCII */ - {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JP */ - {pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_euccn_verifier, 2}, /* PG_EUC_CN */ - {pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_euckr_verifier, 3}, /* PG_EUC_KR */ - {pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_euctw_verifier, 4}, /* PG_EUC_TW */ - {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */ - {pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /* PG_UTF8 */ - {pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier, 4}, /* PG_MULE_INTERNAL */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN1 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN2 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN3 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN4 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN5 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN6 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN7 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN8 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN9 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN10 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1256 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1258 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN866 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN874 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8R */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1251 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1252 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-5 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-6 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-7 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-8 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1250 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1253 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1254 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1255 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1257 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8U */ - {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */ - {0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */ - {0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */ - {0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */ - {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /* PG_GB18030 */ - {0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */ - {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /* PG_SHIFT_JIS_2004 */ + {pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_generic_charinc, pg_ascii_verifier, 1}, /* PG_SQL_ASCII*/ + {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_generic_charinc, pg_eucjp_verifier, 3}, /* PG_EUC_JP*/ + {pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_generic_charinc, pg_euccn_verifier, 2}, /* PG_EUC_CN*/ + {pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_generic_charinc, pg_euckr_verifier, 3}, /* PG_EUC_KR*/ + {pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_generic_charinc, pg_euctw_verifier, 4}, /* PG_EUC_TW*/ + {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_generic_charinc, pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004*/ + {pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_increment, pg_utf8_verifier, 4}, /* PG_UTF8 */ + {pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_generic_charinc, pg_mule_verifier, 4}, /* PG_MULE_INTERNAL*/ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN1 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN2 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN3 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN4 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN5 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN6 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN7 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN8 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN9 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN10 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1256 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1258 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN866 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN874 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_KOI8R */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1251 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1252 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-5 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-6 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-7 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-8 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1250 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1253 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1254 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1255 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1257 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_KOI8U */ + {0, pg_sjis_mblen, pg_sjis_dsplen, pg_generic_charinc, pg_sjis_verifier, 2}, /* PG_SJIS */ + {0, pg_big5_mblen, pg_big5_dsplen, pg_generic_charinc, pg_big5_verifier, 2}, /* PG_BIG5 */ + {0, pg_gbk_mblen, pg_gbk_dsplen, pg_generic_charinc, pg_gbk_verifier, 2}, /* PG_GBK */ + {0, pg_uhc_mblen, pg_uhc_dsplen, pg_generic_charinc, pg_uhc_verifier, 2}, /* PG_UHC */ + {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_generic_charinc, pg_gb18030_verifier, 4}, /* PG_GB18030 */ + {0, pg_johab_mblen, pg_johab_dsplen, pg_generic_charinc, pg_johab_verifier, 3}, /* PG_JOHAB */ + {0, pg_sjis_mblen, pg_sjis_dsplen, pg_generic_charinc, pg_sjis_verifier, 2} /* PG_SHIFT_JIS_2004 */};/* returnsthe byte length of a word for mule internal code */ @@ -1459,6 +1538,15 @@ pg_database_encoding_max_length(void)}/* + * fetch maximum length of the encoding for the current database + */ +character_incrementer +pg_database_encoding_character_incrementer(void) +{ + return pg_wchar_table[GetDatabaseEncoding()].charinc; +} + +/* * Verify mbstr to make sure that it is validly encoded in the current * database encoding. Otherwise same as pg_verify_mbstr().*/ diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 826c7af..356703a 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -284,6 +284,8 @@ typedef int (*mblen_converter) (const unsigned char *mbstr);typedef int (*mbdisplaylen_converter) (constunsigned char *mbstr); +typedef bool (*character_incrementer) (unsigned char *mbstr, int len); +typedef int (*mbverifier) (const unsigned char *mbstr, int len);typedef struct @@ -292,6 +294,7 @@ typedef struct * string to a wchar */ mblen_convertermblen; /* get byte length of a char */ mbdisplaylen_converter dsplen; /* get display widthof a char */ + character_incrementer charinc; /* Character code incrementer if not null */ mbverifier mbverify; /* verifymultibyte sequence */ int maxmblen; /* max bytes for a char in this encoding */} pg_wchar_tbl; @@ -389,6 +392,7 @@ extern int pg_encoding_mbcliplen(int encoding, const char *mbstr,extern int pg_mbcharcliplen(constchar *mbstr, int len, int imit);extern int pg_encoding_max_length(int encoding);extern int pg_database_encoding_max_length(void); +extern character_incrementer pg_database_encoding_character_incrementer(void);extern int PrepareClientEncoding(int encoding);externint SetClientEncoding(int encoding);
On Fri, Jul 8, 2011 at 5:21 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@oss.ntt.co.jp> wrote: > Points to realize this follows, Please add your patch to the next CommitFest. https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [BUGS] make_greater_string() does not return a string in some cases
From
Kyotaro HORIGUCHI
Date:
Thanks for your suggestion, I'll do so. At Fri, 8 Jul 2011 23:28:32 -0400, Robert Haas <robertmhaas@gmail.com> wrote: > Please add your patch to the next CommitFest. > > https://commitfest.postgresql.org/action/commitfest_view/open -- Kyotaro Horiguchi NTT Open Source Software Center
This is an update of a patch for NEXT CommitFest 2011/09. Please ignore this message. 1 Additional Feature - EUC-JP incrementer 2 Bug fixes - bytea incrementer, libpq compilation. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 10b73fb..48a58a0 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -5502,6 +5502,18 @@ pattern_selectivity(Const *patt, Pattern_Type ptype)/* + * This function is "character increment" function for bytea used in + * make_greater_string() that has same interface with pg_wchar_tbl.charinc. + */ +static bool byte_increment(unsigned char *ptr, int len) +{ + if (*ptr >= 255) return false; + + (*ptr)++; + return true; +} + +/* * Try to generate a string greater than the given string or any * string it is a prefix of. If successful, return apalloc'd string * in the form of a Const node; else return NULL. @@ -5540,6 +5552,7 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) int len; Datum cmpstr; text *cmptxt = NULL; + character_incrementer charincfunc; /* * Get a modifiable copy of the prefix string in C-string format, and set @@ -5601,27 +5614,38 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) } } + if (datatype != BYTEAOID) + charincfunc = pg_database_encoding_character_incrementer(); + else + charincfunc = &byte_increment; + while (len > 0) { - unsigned char *lastchar = (unsigned char *) (workstr + len - 1); - unsigned char savelastchar = *lastchar; + int charlen; + unsigned char *lastchar; + unsigned char savelastbyte; + Const *workstr_const; + + if (datatype == BYTEAOID) + charlen = 1; + else + charlen = len - pg_mbcliplen(workstr, len, len - 1); + + lastchar = (unsigned char *) (workstr + len - charlen); /* - * Try to generate a larger string by incrementing the last byte. + * savelastbyte has meaning only for datatype == BYTEAOID */ - while (*lastchar < (unsigned char) 255) - { - Const *workstr_const; + savelastbyte = *lastchar; - (*lastchar)++; + /* + * Try to generate a larger string by incrementing the last byte or + * character. + */ + if (charincfunc(lastchar, charlen)) { if (datatype != BYTEAOID) - { - /* do not generate invalid encoding sequences */ - if (!pg_verifymbstr(workstr, len, true)) - continue; workstr_const = string_to_const(workstr, datatype); - } else workstr_const = string_to_bytea_const(workstr, len); @@ -5636,26 +5660,17 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) pfree(workstr); return workstr_const; } - + /* No good, release unusable value and try again */ pfree(DatumGetPointer(workstr_const->constvalue)); pfree(workstr_const); } - /* restore last byte so we don't confuse pg_mbcliplen */ - *lastchar = savelastchar; - /* - * Truncate off the last character, which might be more than 1 byte, - * depending on the character encoding. + * Truncate off the last character or restore last byte for BYTEA. */ - if (datatype != BYTEAOID && pg_database_encoding_max_length() > 1) - len = pg_mbcliplen(workstr, len, len - 1); - else - len -= 1; - - if (datatype != BYTEAOID) - workstr[len] = '\0'; + len -= charlen; + workstr[len] = (datatype != BYTEAOID ? '\0' : savelastbyte); } /* Failed... */ diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c index 5b0cf62..8505bcb 100644 --- a/src/backend/utils/mb/wchar.c +++ b/src/backend/utils/mb/wchar.c @@ -1336,53 +1336,254 @@ pg_utf8_islegal(const unsigned char *source, int length)/* *------------------------------------------------------------------- + * character incrementer + * + * These functions accept "charptr", a pointer to the first byte of a + * maybe-multibyte character. Try `increment' the character and return true if + * successed. If these functions returns false, the character should be + * untouched. These functions must be implemented in correspondence with + * verifiers, in other words, the rewrited character by this function must pass + * the check by pg_*_verifier() if returns true. Returning the return value of + * pg_*_verifier() corresponding can finnaly avoid such a inconsistency when + * something wrong. + * ------------------------------------------------------------------- + */ + +#ifndef FRONTEND +static bool pg_generic_charinc(unsigned char *charptr, int len) +{ + unsigned char *lastchar = (unsigned char *) (charptr + len - 1); + unsigned char savelastchar = *lastchar; + const char *const_charptr = (const char *)charptr; + + while (*lastchar < (unsigned char) 255) + { + (*lastchar)++; + if (!pg_verifymbstr(const_charptr, len, true)) + continue; + return true; + } + + *lastchar = savelastchar; + return false; +} + +static bool pg_utf8_increment(unsigned char *charptr, int length) +{ + unsigned char a; + unsigned char bak[4]; + bool success; + + memcpy(bak, charptr, length); + switch (length) + { + default: + /* reject lengths 5 and 6 for now */ + return false; + case 4: + a = charptr[3]; + if (a < 0xBF) + { + charptr[3]++; + break; + } + charptr[3] = 0x80; + /* FALL THRU */ + case 3: + a = charptr[2]; + if (a < 0xBF) + { + charptr[2]++; + break; + } + charptr[2] = 0x80; + /* FALL THRU */ + case 2: + a = charptr[1]; + if ((*charptr == 0xed && a < 0x9F) || a < 0xBF) + { + charptr[1]++; + break; + } + charptr[1] = 0x80; + /* FALL THRU */ + case 1: + a = *charptr; + if (a == 0x7F || a == 0xDF || a == 0xEF || a == 0xF7) { + memcpy(charptr, bak, length); + return false; + } + charptr[0]++; + break; + } + + /* Check the result with pg_utf8_islegal as the last resort. */ + success = pg_utf8_islegal(charptr, length); + if (!success) + memcpy(charptr, bak, length); + + return success; +} + +static bool pg_eucjp_increment(unsigned char *charptr, int length) { + unsigned char bak[3]; + bool success; + unsigned char c1, c2; + signed int i; + + memcpy(bak, charptr, length); + + c1 = *charptr; + + switch (c1) + { + case SS2: /* JIS X 0201 */ + if (length != 2) return false; + + c2 = charptr[1]; + + if (c2 > 0xde) + charptr[0] = charptr[1] = 0xa1; + else if (c2 < 0xa1) + charptr[1] = 0xa1; + else + charptr[1]++; + + break; + + case SS3: /* JIS X 0212 */ + if (length != 3) return false; + + for (i = 2 ; i > 1 ; i--) + { + c2 = charptr[i]; + if (c2 < 0xa1) + { + charptr[i] = 0xa1; + return true; + } + else if (c2 < 0xfe) + { + charptr[i]++; + break; + } + charptr[i] = 0xa1; + } + + + if (i == 0) /* Out of code region */ + { + memcpy(charptr, bak, length); + return false; + } + + break; + + default: + if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */ + { + if (length != 2) return false; + + for (i = 1 ; i >= 0 ; i--) /* i must be signed */ + { + c2 = charptr[i]; + if (c2 < 0xa1) + { + charptr[i] = 0xa1; + return true; + } + else if (c2 < 0xfe) + { + charptr[i]++; + break; + } + charptr[i] = 0xa1; + } + + if (i < 0) /* Out of 2 byte code region */ + { + memcpy(charptr, bak, length); + return false; + } + } + else + { /* ASCII */ + if (c1 > 0x7e) + return false; + (*charptr)++; + } + } + + + /* Check the result with pg_eucjp_verifier as the last resort. */ + success = (pg_eucjp_verifier(charptr, length) == length); + if (!success) + memcpy(charptr, bak, length); + + return success; +} +#else +/* + * Character increment functions are not available on frontend. Abort on call + * to prevent miseuse. + */ +static bool pg_generic_charinc(unsigned char *charptr, int len) { + fputs(_("Character incrementer cannot be used in frontend.\n"), stderr); + abort(); +} +#define pg_utf8_increment pg_generic_charinc +#define pg_eucjp_increment pg_generic_charinc +#endif + +/* + *------------------------------------------------------------------- * encoding info table * XXX must be sorted by thesame order as enum pg_enc (in mb/pg_wchar.h) *------------------------------------------------------------------- */pg_wchar_tblpg_wchar_table[] = { - {pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifier, 1}, /* PG_SQL_ASCII */ - {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JP */ - {pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_euccn_verifier, 2}, /* PG_EUC_CN */ - {pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_euckr_verifier, 3}, /* PG_EUC_KR */ - {pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_euctw_verifier, 4}, /* PG_EUC_TW */ - {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */ - {pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /* PG_UTF8 */ - {pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier, 4}, /* PG_MULE_INTERNAL */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN1 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN2 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN3 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN4 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN5 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN6 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN7 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN8 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN9 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN10 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1256 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1258 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN866 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN874 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8R */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1251 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1252 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-5 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-6 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-7 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-8 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1250 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1253 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1254 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1255 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1257 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8U */ - {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */ - {0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */ - {0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */ - {0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */ - {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /* PG_GB18030 */ - {0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */ - {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /* PG_SHIFT_JIS_2004 */ + {pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_generic_charinc, pg_ascii_verifier, 1}, /* PG_SQL_ASCII*/ + {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_increment, pg_eucjp_verifier, 3}, /* PG_EUC_JP*/ + {pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_generic_charinc, pg_euccn_verifier, 2}, /* PG_EUC_CN*/ + {pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_generic_charinc, pg_euckr_verifier, 3}, /* PG_EUC_KR*/ + {pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_generic_charinc, pg_euctw_verifier, 4}, /* PG_EUC_TW*/ + {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_increment, pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004*/ + {pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_increment, pg_utf8_verifier, 4}, /* PG_UTF8 */ + {pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_generic_charinc, pg_mule_verifier, 4}, /* PG_MULE_INTERNAL*/ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN1 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN2 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN3 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN4 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN5 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN6 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN7 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN8 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN9 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN10 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1256 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1258 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN866 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN874 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_KOI8R */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1251 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1252 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-5 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-6 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-7 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-8 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1250 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1253 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1254 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1255 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1257 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_KOI8U */ + {0, pg_sjis_mblen, pg_sjis_dsplen, pg_generic_charinc, pg_sjis_verifier, 2}, /* PG_SJIS */ + {0, pg_big5_mblen, pg_big5_dsplen, pg_generic_charinc, pg_big5_verifier, 2}, /* PG_BIG5 */ + {0, pg_gbk_mblen, pg_gbk_dsplen, pg_generic_charinc, pg_gbk_verifier, 2}, /* PG_GBK */ + {0, pg_uhc_mblen, pg_uhc_dsplen, pg_generic_charinc, pg_uhc_verifier, 2}, /* PG_UHC */ + {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_generic_charinc, pg_gb18030_verifier, 4}, /* PG_GB18030 */ + {0, pg_johab_mblen, pg_johab_dsplen, pg_generic_charinc, pg_johab_verifier, 3}, /* PG_JOHAB */ + {0, pg_sjis_mblen, pg_sjis_dsplen, pg_generic_charinc, pg_sjis_verifier, 2} /* PG_SHIFT_JIS_2004 */};/* returnsthe byte length of a word for mule internal code */ @@ -1416,7 +1617,7 @@ pg_encoding_dsplen(int encoding, const char *mbstr) return ((encoding >= 0 && encoding< sizeof(pg_wchar_table) / sizeof(pg_wchar_tbl)) ? - ((*pg_wchar_table[encoding].dsplen) ((const unsigned char *) mbstr)) : + ((*pg_wchar_table[encoding].dsplen) ((const unsigned char *) mbstr)) : ((*pg_wchar_table[PG_SQL_ASCII].dsplen)((const unsigned char *) mbstr)));} @@ -1459,6 +1660,15 @@ pg_database_encoding_max_length(void)}/* + * give the character incrementer for the encoding for the current database + */ +character_incrementer +pg_database_encoding_character_incrementer(void) +{ + return pg_wchar_table[GetDatabaseEncoding()].charinc; +} + +/* * Verify mbstr to make sure that it is validly encoded in the current * database encoding. Otherwise same as pg_verify_mbstr().*/ diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 826c7af..356703a 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -284,6 +284,8 @@ typedef int (*mblen_converter) (const unsigned char *mbstr);typedef int (*mbdisplaylen_converter) (constunsigned char *mbstr); +typedef bool (*character_incrementer) (unsigned char *mbstr, int len); +typedef int (*mbverifier) (const unsigned char *mbstr, int len);typedef struct @@ -292,6 +294,7 @@ typedef struct * string to a wchar */ mblen_convertermblen; /* get byte length of a char */ mbdisplaylen_converter dsplen; /* get display widthof a char */ + character_incrementer charinc; /* Character code incrementer if not null */ mbverifier mbverify; /* verifymultibyte sequence */ int maxmblen; /* max bytes for a char in this encoding */} pg_wchar_tbl; @@ -389,6 +392,7 @@ extern int pg_encoding_mbcliplen(int encoding, const char *mbstr,extern int pg_mbcharcliplen(constchar *mbstr, int len, int imit);extern int pg_encoding_max_length(int encoding);extern int pg_database_encoding_max_length(void); +extern character_incrementer pg_database_encoding_character_incrementer(void);extern int PrepareClientEncoding(int encoding);externint SetClientEncoding(int encoding);
This is rebased patch of `Allow encoding specific character incrementer'(https://commitfest.postgresql.org/action/patch_view?id=602). Addition to the patch, increment sanity check program for new functions pg_generic_charinc and pg_utf8_increment is attached. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 5d999e6..b7f1922 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -5652,6 +5652,18 @@ pattern_selectivity(Const *patt, Pattern_Type ptype)/* + * This function is "character increment" function for bytea used in + * make_greater_string() that has same interface with pg_wchar_tbl.charinc. + */ +static bool byte_increment(unsigned char *ptr, int len) +{ + if (*ptr >= 255) return false; + + (*ptr)++; + return true; +} + +/* * Try to generate a string greater than the given string or any * string it is a prefix of. If successful, return apalloc'd string * in the form of a Const node; else return NULL. @@ -5690,6 +5702,7 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) int len; Datum cmpstr; text *cmptxt = NULL; + character_incrementer charincfunc; /* * Get a modifiable copy of the prefix string in C-string format, and set @@ -5751,27 +5764,38 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) } } + if (datatype != BYTEAOID) + charincfunc = pg_database_encoding_character_incrementer(); + else + charincfunc = &byte_increment; + while (len > 0) { - unsigned char *lastchar = (unsigned char *) (workstr + len - 1); - unsigned char savelastchar = *lastchar; + int charlen; + unsigned char *lastchar; + unsigned char savelastbyte; + Const *workstr_const; + + if (datatype == BYTEAOID) + charlen = 1; + else + charlen = len - pg_mbcliplen(workstr, len, len - 1); + + lastchar = (unsigned char *) (workstr + len - charlen); /* - * Try to generate a larger string by incrementing the last byte. + * savelastbyte has meaning only for datatype == BYTEAOID */ - while (*lastchar < (unsigned char) 255) - { - Const *workstr_const; + savelastbyte = *lastchar; - (*lastchar)++; + /* + * Try to generate a larger string by incrementing the last byte or + * character. + */ + if (charincfunc(lastchar, charlen)) { if (datatype != BYTEAOID) - { - /* do not generate invalid encoding sequences */ - if (!pg_verifymbstr(workstr, len, true)) - continue; workstr_const = string_to_const(workstr, datatype); - } else workstr_const = string_to_bytea_const(workstr, len); @@ -5786,26 +5810,17 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) pfree(workstr); return workstr_const; } - + /* No good, release unusable value and try again */ pfree(DatumGetPointer(workstr_const->constvalue)); pfree(workstr_const); } - /* restore last byte so we don't confuse pg_mbcliplen */ - *lastchar = savelastchar; - /* - * Truncate off the last character, which might be more than 1 byte, - * depending on the character encoding. + * Truncate off the last character or restore last byte for BYTEA. */ - if (datatype != BYTEAOID && pg_database_encoding_max_length() > 1) - len = pg_mbcliplen(workstr, len, len - 1); - else - len -= 1; - - if (datatype != BYTEAOID) - workstr[len] = '\0'; + len -= charlen; + workstr[len] = (datatype != BYTEAOID ? '\0' : savelastbyte); } /* Failed... */ diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c index f23732f..00b3e2a 100644 --- a/src/backend/utils/mb/wchar.c +++ b/src/backend/utils/mb/wchar.c @@ -1,3 +1,4 @@ +/* * conversion functions between pg_wchar and multibyte streams. * Tatsuo Ishii @@ -1336,53 +1337,254 @@ pg_utf8_islegal(const unsigned char *source, int length)/* *------------------------------------------------------------------- + * character incrementer + * + * These functions accept "charptr", a pointer to the first byte of a + * maybe-multibyte character. Try `increment' the character and return true if + * successed. If these functions returns false, the character should be + * untouched. These functions must be implemented in correspondence with + * verifiers, in other words, the rewrited character by this function must pass + * the check by pg_*_verifier() if returns true. Returning the return value of + * pg_*_verifier() corresponding can finnaly avoid such a inconsistency when + * something wrong. + * ------------------------------------------------------------------- + */ + +#ifndef FRONTEND +static bool pg_generic_charinc(unsigned char *charptr, int len) +{ + unsigned char *lastchar = (unsigned char *) (charptr + len - 1); + unsigned char savelastchar = *lastchar; + const char *const_charptr = (const char *)charptr; + + while (*lastchar < (unsigned char) 255) + { + (*lastchar)++; + if (!pg_verifymbstr(const_charptr, len, true)) + continue; + return true; + } + + *lastchar = savelastchar; + return false; +} + +static bool pg_utf8_increment(unsigned char *charptr, int length) +{ + unsigned char a; + unsigned char bak[4]; + bool success; + + memcpy(bak, charptr, length); + switch (length) + { + default: + /* reject lengths 5 and 6 for now */ + return false; + case 4: + a = charptr[3]; + if (a < 0xBF) + { + charptr[3]++; + break; + } + charptr[3] = 0x80; + /* FALL THRU */ + case 3: + a = charptr[2]; + if (a < 0xBF) + { + charptr[2]++; + break; + } + charptr[2] = 0x80; + /* FALL THRU */ + case 2: + a = charptr[1]; + if ((*charptr == 0xed && a < 0x9F) || a < 0xBF) + { + charptr[1]++; + break; + } + charptr[1] = 0x80; + /* FALL THRU */ + case 1: + a = *charptr; + if (a == 0x7F || a == 0xDF || a == 0xEF || a == 0xF7) { + memcpy(charptr, bak, length); + return false; + } + charptr[0]++; + break; + } + + /* Check the result with pg_utf8_islegal as the last resort. */ + success = pg_utf8_islegal(charptr, length); + if (!success) + memcpy(charptr, bak, length); + + return success; +} + +static bool pg_eucjp_increment(unsigned char *charptr, int length) { + unsigned char bak[3]; + bool success; + unsigned char c1, c2; + signed int i; + + memcpy(bak, charptr, length); + + c1 = *charptr; + + switch (c1) + { + case SS2: /* JIS X 0201 */ + if (length != 2) return false; + + c2 = charptr[1]; + + if (c2 > 0xde) + charptr[0] = charptr[1] = 0xa1; + else if (c2 < 0xa1) + charptr[1] = 0xa1; + else + charptr[1]++; + + break; + + case SS3: /* JIS X 0212 */ + if (length != 3) return false; + + for (i = 2 ; i > 1 ; i--) + { + c2 = charptr[i]; + if (c2 < 0xa1) + { + charptr[i] = 0xa1; + return true; + } + else if (c2 < 0xfe) + { + charptr[i]++; + break; + } + charptr[i] = 0xa1; + } + + + if (i == 0) /* Out of code region */ + { + memcpy(charptr, bak, length); + return false; + } + + break; + + default: + if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */ + { + if (length != 2) return false; + + for (i = 1 ; i >= 0 ; i--) /* i must be signed */ + { + c2 = charptr[i]; + if (c2 < 0xa1) + { + charptr[i] = 0xa1; + return true; + } + else if (c2 < 0xfe) + { + charptr[i]++; + break; + } + charptr[i] = 0xa1; + } + + if (i < 0) /* Out of 2 byte code region */ + { + memcpy(charptr, bak, length); + return false; + } + } + else + { /* ASCII */ + if (c1 > 0x7e) + return false; + (*charptr)++; + } + } + + + /* Check the result with pg_eucjp_verifier as the last resort. */ + success = (pg_eucjp_verifier(charptr, length) == length); + if (!success) + memcpy(charptr, bak, length); + + return success; +} +#else +/* + * Character increment functions are not available on frontend. Abort on call + * to prevent miseuse. + */ +static bool pg_generic_charinc(unsigned char *charptr, int len) { + fputs(_("Character incrementer cannot be used in frontend.\n"), stderr); + abort(); +} +#define pg_utf8_increment pg_generic_charinc +#define pg_eucjp_increment pg_generic_charinc +#endif + +/* + *------------------------------------------------------------------- * encoding info table * XXX must be sorted by thesame order as enum pg_enc (in mb/pg_wchar.h) *------------------------------------------------------------------- */pg_wchar_tblpg_wchar_table[] = { - {pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifier, 1}, /* PG_SQL_ASCII */ - {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JP */ - {pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_euccn_verifier, 2}, /* PG_EUC_CN */ - {pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_euckr_verifier, 3}, /* PG_EUC_KR */ - {pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_euctw_verifier, 4}, /* PG_EUC_TW */ - {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */ - {pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /* PG_UTF8 */ - {pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier, 4}, /* PG_MULE_INTERNAL */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN1 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN2 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN3 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN4 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN5 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN6 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN7 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN8 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN9 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN10 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1256 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1258 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN866 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN874 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8R */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1251 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1252 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-5 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-6 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-7 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-8 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1250 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1253 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1254 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1255 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1257 */ - {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8U */ - {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */ - {0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */ - {0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */ - {0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */ - {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /* PG_GB18030 */ - {0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */ - {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /* PG_SHIFT_JIS_2004 */ + {pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_generic_charinc, pg_ascii_verifier, 1}, /* PG_SQL_ASCII*/ + {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_increment, pg_eucjp_verifier, 3}, /* PG_EUC_JP*/ + {pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_generic_charinc, pg_euccn_verifier, 2}, /* PG_EUC_CN*/ + {pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_generic_charinc, pg_euckr_verifier, 3}, /* PG_EUC_KR*/ + {pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_generic_charinc, pg_euctw_verifier, 4}, /* PG_EUC_TW*/ + {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_increment, pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004*/ + {pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_increment, pg_utf8_verifier, 4}, /* PG_UTF8 */ + {pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_generic_charinc, pg_mule_verifier, 4}, /* PG_MULE_INTERNAL*/ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN1 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN2 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN3 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN4 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN5 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN6 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN7 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN8 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN9 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_LATIN10 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1256 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1258 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN866 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN874 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_KOI8R */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1251 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1252 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-5 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-6 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-7 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*ISO-8859-8 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1250 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1253 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1254 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1255 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_WIN1257 */ + {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_generic_charinc, pg_latin1_verifier, 1}, /*PG_KOI8U */ + {0, pg_sjis_mblen, pg_sjis_dsplen, pg_generic_charinc, pg_sjis_verifier, 2}, /* PG_SJIS */ + {0, pg_big5_mblen, pg_big5_dsplen, pg_generic_charinc, pg_big5_verifier, 2}, /* PG_BIG5 */ + {0, pg_gbk_mblen, pg_gbk_dsplen, pg_generic_charinc, pg_gbk_verifier, 2}, /* PG_GBK */ + {0, pg_uhc_mblen, pg_uhc_dsplen, pg_generic_charinc, pg_uhc_verifier, 2}, /* PG_UHC */ + {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_generic_charinc, pg_gb18030_verifier, 4}, /* PG_GB18030 */ + {0, pg_johab_mblen, pg_johab_dsplen, pg_generic_charinc, pg_johab_verifier, 3}, /* PG_JOHAB */ + {0, pg_sjis_mblen, pg_sjis_dsplen, pg_generic_charinc, pg_sjis_verifier, 2} /* PG_SHIFT_JIS_2004 */};/* returnsthe byte length of a word for mule internal code */ @@ -1459,6 +1661,15 @@ pg_database_encoding_max_length(void)}/* + * give the character incrementer for the encoding for the current database + */ +character_incrementer +pg_database_encoding_character_incrementer(void) +{ + return pg_wchar_table[GetDatabaseEncoding()].charinc; +} + +/* * Verify mbstr to make sure that it is validly encoded in the current * database encoding. Otherwise same as pg_verify_mbstr().*/ diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 826c7af..356703a 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -284,6 +284,8 @@ typedef int (*mblen_converter) (const unsigned char *mbstr);typedef int (*mbdisplaylen_converter) (constunsigned char *mbstr); +typedef bool (*character_incrementer) (unsigned char *mbstr, int len); +typedef int (*mbverifier) (const unsigned char *mbstr, int len);typedef struct @@ -292,6 +294,7 @@ typedef struct * string to a wchar */ mblen_convertermblen; /* get byte length of a char */ mbdisplaylen_converter dsplen; /* get display widthof a char */ + character_incrementer charinc; /* Character code incrementer if not null */ mbverifier mbverify; /* verifymultibyte sequence */ int maxmblen; /* max bytes for a char in this encoding */} pg_wchar_tbl; @@ -389,6 +392,7 @@ extern int pg_encoding_mbcliplen(int encoding, const char *mbstr,extern int pg_mbcharcliplen(constchar *mbstr, int len, int imit);extern int pg_encoding_max_length(int encoding);extern int pg_database_encoding_max_length(void); +extern character_incrementer pg_database_encoding_character_incrementer(void);extern int PrepareClientEncoding(int encoding);externint SetClientEncoding(int encoding); // sanity test for utf specific character incrementer. // // -v displays status for invalid source code. // -m displays status for the result that the product of new // incrementer is match to the one of the generic incrementer. // show all status lines when both -v and -m are specified. // // `utftest | grep FAILED' shows remaining glitches using new // incrementer. (4 lines) // // CAUTION: this program yields so much lines. // // `utftest' yields 17375 lines. These lines are the saved by new func // and remaining glitches. // `utftest -m' yields 1112064 lines. // // // Sample of status lines: // src char src utf8 dst utf8 dest char result status // 000d7bf => ed9ebf => ed9f80 (000d7c0) successed - Don't match to generic inc(000d7bf) // 000d7ff => ed9fbf => ed9fbf (000d7ff) FAILED - Match to generic inc // 000d800 => eda080 Source not valid utf8 // // successed/FAILED in result status shows the return value of // character increment function. Following description says that the // result of the new incrementer was/was'nt identical to the generic // incrementer. #include <stdio.h> #include <stdarg.h> typedef int bool; static int true = 1; static int false = 0; static bool pg_utf8_increment(unsigned char *mbstr, int length); static bool pg_generic_charinc(unsigned char *charptr, int len); void uni2utf8(unsigned int unicode, unsigned char *utf8buf); unsigned int utf8tounicode(unsigned char *utf8buf); int scatf(char* buf, char* format, ...); int main(int argc, char** argv) { unsigned char buf[4], buf2[4]; char outbuf[1024]; unsigned int i; int dispinvalid = 0;int dispmatch = 0; for (i = 1 ; i < argc ; i++) {if (strcmp(argv[i], "-v") == 0) dispinvalid = 1;if (strcmp(argv[i], "-m") == 0) dispmatch= 1; } for(i = 0 ; i < 0x1010000 ; i++) {bool prechk, successed, gensuccess, match; uni2utf8(i, buf);uni2utf8(i, buf2);*outbuf = 0; scatf(outbuf, "%07x => ", i); int len = pg_utf_mblen(buf); int j = 0; while (j < len) scatf(outbuf, "%02x", buf[j++]); while (j < 4) { scatf(outbuf, " "); j++;} prechk = pg_utf8_islegal(buf, len);if (! prechk) { scatf(outbuf, "Source notvalid utf8"); if (dispinvalid) puts(outbuf); continue;} successed = pg_utf8_increment(buf, len);scatf(outbuf, " => ");j = 0;while (j < len) scatf(outbuf, "%02x", buf[j++]); while(j < 4) { scatf(outbuf, " "); j++;} gensuccess = pg_generic_charinc(buf2, len);match = (memcmp(buf, buf2, len) == 0); if (!gensuccess || !match || dispmatch) { scatf(outbuf, "(%07x) %s - %s", utf8tounicode(buf), (successed? "successed" : "FAILED"), (match ? "Match to generic inc" : "Don't match to generic inc")); if (!match){ scatf(outbuf, "(%07x)", utf8tounicode(buf2)); } puts(outbuf);} } } bool pg_utf8_islegal(const unsigned char *source, int length) {unsigned char a; switch (length){ default: /* reject lengths 5 and 6 for now */ return false; case 4: a = source[3]; if (a < 0x80 || a > 0xBF) return false; /* FALL THRU */ case 3: a = source[2]; if (a < 0x80 || a > 0xBF) return false; /* FALL THRU */ case 2: a = source[1]; switch (*source) { case 0xE0: if (a < 0xA0 || a > 0xBF) return false; break; case 0xED: if (a < 0x80 || a > 0x9F) return false; break; case 0xF0: if (a < 0x90 || a > 0xBF) returnfalse; break; case 0xF4: if (a < 0x80 || a > 0x8F) returnfalse; break; default: if (a < 0x80 || a > 0xBF) returnfalse; break; } /* FALL THRU */ case 1: a = *source; if (a >= 0x80 &&a < 0xC2) return false; if (a > 0xF4) return false; break;}return true; } int pg_utf_mblen(const unsigned char *s) {int len; if ((*s & 0x80) == 0) len = 1;else if ((*s & 0xe0) == 0xc0) len = 2;else if ((*s & 0xf0) == 0xe0) len = 3;else if((*s & 0xf8) == 0xf0) len = 4; #ifdef NOT_USEDelse if ((*s & 0xfc) == 0xf8) len = 5;else if ((*s & 0xfe) == 0xfc) len = 6; #endifelse len = 1;return len; } static bool pg_utf8_increment(unsigned char *charptr, int length) { unsigned char a; unsigned char bak[4]; bool success; memcpy(bak, charptr, length); switch (length) { default: /* reject lengths 5 and 6 for now */ return false; case 4: a = charptr[3]; if (a < 0xBF) { charptr[3]++; break; } charptr[3] = 0x80; /* FALL THRU */ case 3: a = charptr[2]; if (a < 0xBF) { charptr[2]++; break; } charptr[2] = 0x80; /* FALLTHRU */ case 2: a = charptr[1]; if ((*charptr == 0xed && a < 0x9F) || a < 0xBF) { charptr[1]++; break; } charptr[1] = 0x80; /* FALL THRU*/ case 1: a = *charptr; if (a == 0x7F || a == 0xDF || a == 0xEF || a == 0xF7) { memcpy(charptr, bak, length); return false; } charptr[0]++; break; } /* Check the result with pg_utf8_islegal as the last resort. */ success = pg_utf8_islegal(charptr, length); if (!success) memcpy(charptr, bak, length); return success; } void uni2utf8(unsigned int unicode, unsigned char *utf8buf) { int i, len; if (unicode < 0x80) { len = 1; *utf8buf =0; } else if (unicode < 0x800) { len = 2; *utf8buf = 0xc0; } else if (unicode < 0x10000) { len = 3; *utf8buf = 0xe0;} else if (unicode < 0x110000) { len = 4; *utf8buf = 0xf0; } else { printf("Uunicode of of range: %x\n", unicode);exit(1);} for(i = len - 1 ; i > 0 ; i--) { utf8buf[i] = (0x80 | (unicode & 0x3f)); unicode >>= 6; } *utf8buf |= unicode; } unsigned int utf8tounicode(unsigned char *utf8buf) { unsigned int a = *utf8buf; if (a < 0x80) return a; if (a < 0xc0) return0xfffffff; if (a < 0xe0)return ((utf8buf[0] - 0xc0) << 6) + (utf8buf[1] - 0x80); if (a < 0xf0)return ((utf8buf[0]- 0xe0) << 12) + ((utf8buf[1] - 0x80) << 6) + utf8buf[2] - 0x80; if (a < 0xf8)return ((utf8buf[0] - 0xf0)<< 18) + ((utf8buf[1] - 0x80) << 12) + ((utf8buf[2] - 0x80) << 6) + utf8buf[3] - 0x80; return 0xfffffff; } static bool pg_generic_charinc(unsigned char *charptr, int len) { unsigned char *lastchar = (unsigned char *) (charptr + len - 1); unsigned char savelastchar = *lastchar; const char *const_charptr = (const char *)charptr; while (*lastchar < (unsigned char) 255) { (*lastchar)++; if (!pg_utf8_islegal(const_charptr, len)) // modified. continue; return true; } *lastchar = savelastchar; return false; } int scatf(char* buf, char* format, ...) { va_list args; int ret; va_start(args, format); ret = vsprintf(buf + strlen(buf), format, args); va_end(args); return ret; } // sanity test for euc-japan specific character incrementer. // // -v displays status for invalid source charcode. // -m displays status for the result that the product of new // incrementer is match to the one of the generic incrementer. // show all status lines when both -v and -m are specified. // // `euctest | grep FAILED' shows remaining glitches using new // incrementer. (2 lines) // // CAUTION: // `euctest' yields 190 lines. // `euctest -m' yields 17863 lines. // `euctest -m -v' yields 16843008 lines. // // Sample of output lines: // src => dest result - status // 7e => 7f successed - Match to generic inc // 7f => 7f FAILED - Match to generic inc // 8edf => a1a1 successed - Don't match to generic inc(8edf) // // successed/FAILED in result status shows the return value of // character increment function. Following description says that the // result of the new incrementer was/was'nt identical to the generic // incrementer. #include <stdio.h> #include <stdarg.h> #define SS2 0x8e /* single shift 2 (JIS0201) */ #define SS3 0x8f /* single shift 3 (JIS0212) */ #define HIGHBIT (0x80) #define IS_HIGHBIT_SET(ch) ((unsigned char)(ch) & HIGHBIT) typedef int bool; static int false = 0; static int true = 1; static bool pg_generic_charinc(unsigned char *charptr, int len); static bool pg_eucjp_increment(unsigned char *charptr, int length); static int pg_eucjp_verifier(const unsigned char *s, int len); void do_check(int len, unsigned char *buf, int dispinvalid, int dispmatch); int scatf(char* buf, char* format, ...); int main(int argc, char **argv) { unsigned int i, j, k; unsigned char buf[3]; int res; int dispinvalid = 0; int dispmatch= 0; for (i = 1 ; i < argc ; i++) {if (strcmp(argv[i], "-v") == 0) dispinvalid = 1;if (strcmp(argv[i], "-m") == 0) dispmatch= 1; } // single byte characters for (i = 0 ; i < 256 ; i++) {*buf = i;do_check(1, buf, dispinvalid, dispmatch); } // 2 byte characters for (i = 0 ; i < 256 ; i++) {for (j = 0 ; j < 256 ; j++) { *buf = i; buf[1] = j; do_check(2, buf,dispinvalid, dispmatch);} } // 3 byte characters for (i = 0 ; i < 256 ; i++) {for (j = 0 ; j < 256 ; j++) { for (k = 0 ; k < 256 ; k++) { *buf =i; buf[1] = j; buf[2] = k; do_check(3, buf, dispinvalid, dispmatch); } } } } void do_check(int len, unsigned char *buf, int dispinvalid, int dispmatch) { unsigned char buf2[3]; char outbuf[1024]; inti, src_is_valid, successed, gensuccessed, match; *outbuf = 0; src_is_valid = (pg_eucjp_verifier(buf, len) == len); if (!src_is_valid) {if (dispinvalid) { for (i = 0 ; i < len ; i++) scatf(outbuf, "%02x", buf[i]); strcat(outbuf, "- Src char is invalid."); puts(outbuf);}return; } memcpy(buf2, buf, len); for (i = 0 ; i < len ; i++)scatf(outbuf, "%02x", ((int)buf[i] & 0xff)); strcat(outbuf, " => "); successed = pg_eucjp_increment(buf, len); gensuccessed = pg_generic_charinc((char*)buf2, len); match = (memcmp(buf, buf2,len) == 0); if (!gensuccessed || !match || dispmatch) {for (i = 0 ; i < len ; i++) scatf(outbuf, "%02x", ((int)buf[i]& 0xff));scatf(outbuf, " %s - %s", (successed ? "successed" : "FAILED"), (match ? "Match to genericinc" : "Don't match to generic inc"));if (!match) { strcat(outbuf, "("); for (i = 0 ; i < len ; i++) scatf(outbuf,"%02x", ((int)buf2[i] & 0xff)); strcat(outbuf, ")");}puts(outbuf); } } static bool pg_eucjp_increment(unsigned char *charptr, int length) { unsigned char bak[3]; bool success; unsignedchar c1, c2; signed int i; memcpy(bak, charptr, length); c1 = *charptr; switch (c1) { caseSS2: /* JIS X 0201 */ if (length != 2) return false; c2 = charptr[1]; if (c2 > 0xde) charptr[0] = charptr[1] = 0xa1; else if (c2 < 0xa1) charptr[1] = 0xa1; else charptr[1]++; break; case SS3: /* JIS X 0212 */ if (length !=3) return false; for (i = 2 ; i > 1 ; i--) { c2 = charptr[i]; if (c2< 0xa1) { charptr[i] = 0xa1; return true; } else if (c2 < 0xfe) { charptr[i]++; break; } charptr[i] = 0xa1; } if (i == 0) /* Out of code region */ { memcpy(charptr, bak, length); return false; } break; default: if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */ { if (length != 2) return false; for (i = 1 ; i >= 0 ; i--) /* i must be signed */ { c2= charptr[i]; if (c2 < 0xa1) { charptr[i] = 0xa1; return true; } else if (c2 < 0xfe) { charptr[i]++; break; } charptr[i] = 0xa1; } if (i < 0) /* Out of 2 byte code region */ { memcpy(charptr,bak, length); return false; } } else { /* ASCII */ if (c1 > 0x7e) return false; (*charptr)++; } } /* Check the result with pg_eucjp_verifier as the last resort. */ success = (pg_eucjp_verifier(charptr, length)== length); if (!success) memcpy(charptr, bak, length); return success; } #define IS_EUC_RANGE_VALID(c) ((c) >= 0xa1 && (c) <= 0xfe) static int pg_eucjp_verifier(const unsigned char *s, int len) { int l; unsigned char c1,c2; c1 = *s++; switch (c1){case SS2: /* JIS X 0201 */ l = 2; if (l > len) return -1; c2 = *s++; if (c2 < 0xa1 ||c2 > 0xdf) return -1; break; case SS3: /* JIS X 0212 */ l = 3; if (l > len) return -1; c2 = *s++; if (!IS_EUC_RANGE_VALID(c2)) return -1; c2 = *s++; if (!IS_EUC_RANGE_VALID(c2)) return -1; break; default: if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */ { l = 2; if (l > len) return -1; if (!IS_EUC_RANGE_VALID(c1)) return -1; c2 = *s++; if (!IS_EUC_RANGE_VALID(c2)) return -1; } else /* must be ASCII */ { l = 1; } break;} return l; } static bool pg_generic_charinc(unsigned char *charptr, int len) { unsigned char *lastchar = (unsigned char *) (charptr + len - 1); unsigned char savelastchar = *lastchar; const char *const_charptr = (const char *)charptr; while (*lastchar < (unsigned char) 255) { (*lastchar)++; if (pg_eucjp_verifier(const_charptr, len) != len) // modified. continue; return true; } *lastchar = savelastchar; return false; } int scatf(char* buf, char* format, ...) { va_list args; int ret; va_start(args, format); ret = vsprintf(buf + strlen(buf), format, args); va_end(args); return ret; }
On Tue, Sep 13, 2011 at 10:13 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@oss.ntt.co.jp> wrote: > This is rebased patch of `Allow encoding specific character > incrementer'(https://commitfest.postgresql.org/action/patch_view?id=602). > > Addition to the patch, increment sanity check program for new > functions pg_generic_charinc and pg_utf8_increment is attached. I took a look at this patch ... and the thread ... and the previous thread with the same subject line: http://archives.postgresql.org/pgsql-bugs/2010-06/msg00303.php As I understand it, the issue here is that when we try to generate suitable > and < quals for a LIKE expression, we need to find a string which is greater than the prefix we're searching for, and the existing algorithm sometimes fails. But there seems to be some dispute over how likely this is to occur. Tom argues that the case is so rare that we shouldn't worry about it: http://archives.postgresql.org/pgsql-bugs/2010-06/msg00336.php ...while Kyotaro Horiguchi clearly feels otherwise, citing the statistic that about 100 out of 7000 Japanese characters fail to work properly: http://archives.postgresql.org/pgsql-bugs/2011-07/msg00064.php That statistic seems to justify some action, but what? Ideas: 1. Adopt the patch as proposed, or something like it. 2. Instead of installing encoding-specific character incrementing functions, we could try to come up with a more reliable generic algorithm. Not sure exactly what, though. 3. Come up with some way to avoid needing to do this in the first place. One random idea I have is - instead of generating > and < clauses, could we define a "prefix match" operator - i.e. a ### b iff substr(a, 1, length(b)) = b? We'd need to do something about the selectivity, but I don't see why that would be a problem. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > As I understand it, the issue here is that when we try to generate > suitable > and < quals for a LIKE expression, we need to find a string > which is greater than the prefix we're searching for, and the existing > algorithm sometimes fails. But there seems to be some dispute over > how likely this is to occur. Tom argues that the case is so rare that > we shouldn't worry about it: > http://archives.postgresql.org/pgsql-bugs/2010-06/msg00336.php Well, I didn't like the amount of overhead that was implied by that proposal. I don't have a great deal of difficulty with the idea of encoding-specific character incrementers, especially since they could presumably be more efficient than the current brute force approach wherein we call pg_verifymbstr on the entire string after mangling just the last byte. The main risk that I foresee with the proposed approach is that if you have, say, a 4-byte final character, you could iterate through a *whole lot* (millions) of larger encoded characters, with absolutely no hope of making a greater string that way when the determining difference occurs at some earlier character. And then when you do run out, you could waste just as much time at the immediately preceding character, etc etc. The existing algorithm is a compromise between thoroughness of investigation of different values of the last character versus speed of falling back to change the preceding character instead. I'd be the first to say that incrementing only the last byte is a very quick-and-dirty heuristic for making that happen. But I think it would be unwise to allow the thing to consider more than a few hundred values for a character position before giving up. Possibly the encoding-specific incrementer could be designed to run through all legal values of the last byte, then start skipping larger and larger ranges --- maybe just move directly to incrementing the first byte. Aside from that issue, the submitted patch seems to need quite a lot of detail work; for instance, I noted an unconstrained memcpy into a 4-byte local buffer, as well as lots and lots of violations of PG house style. That's certainly all fixable but somebody will have to go through it. > One random idea I have is - instead of generating > and < clauses, > could we define a "prefix match" operator - i.e. a ### b iff substr(a, > 1, length(b)) = b? We'd need to do something about the selectivity, > but I don't see why that would be a problem. The problem is that you'd need to make that a btree-indexable operator. regards, tom lane
On Wed, Sep 21, 2011 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The main risk that I foresee with the proposed approach is that if you > have, say, a 4-byte final character, you could iterate through a *whole > lot* (millions) of larger encoded characters, with absolutely no hope of > making a greater string that way when the determining difference occurs > at some earlier character. And then when you do run out, you could > waste just as much time at the immediately preceding character, etc etc. > The existing algorithm is a compromise between thoroughness of > investigation of different values of the last character versus speed of > falling back to change the preceding character instead. I'd be the > first to say that incrementing only the last byte is a very > quick-and-dirty heuristic for making that happen. But I think it would > be unwise to allow the thing to consider more than a few hundred values > for a character position before giving up. Possibly the > encoding-specific incrementer could be designed to run through all legal > values of the last byte, then start skipping larger and larger ranges > --- maybe just move directly to incrementing the first byte. I'm a bit perplexed as to why we can't find a non-stochastic way of doing this. > Aside from that issue, the submitted patch seems to need quite a lot of > detail work; for instance, I noted an unconstrained memcpy into a 4-byte > local buffer, as well as lots and lots of violations of PG house style. > That's certainly all fixable but somebody will have to go through it. I noticed that, too, but figured we should agree on the basic approach first. In the first instance, "someone" will hopefully be the patch author. (Help from anyone else is also welcome, of course... we have almost 40 patches left to crawl through and, at the moment, very few reviewers.) >> One random idea I have is - instead of generating > and < clauses, >> could we define a "prefix match" operator - i.e. a ### b iff substr(a, >> 1, length(b)) = b? We'd need to do something about the selectivity, >> but I don't see why that would be a problem. > > The problem is that you'd need to make that a btree-indexable operator. Well, right. Without that, there's not much point. But do you think that's prohibitively difficult? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I'm a bit perplexed as to why we can't find a non-stochastic way of doing this. Because the behavior of common collation algorithms is so wacky as to approach stochasticity. In particular, as soon as your string contains a mix of letter and non-letter characters, "dictionary" algorithms tend to behave really strangely. We went around on this quite a few times before settling on the current approach; we kept thinking exactly what you're thinking, namely that we ought to be able to predict more than nothing about what the collation algorithm would consider larger. An example of the weirdness is that in en_US collation, we have postgres=# select '/root/' < '/root0';?column? ----------t (1 row) postgres=# select '/root/t' < '/root0';?column? ----------f (1 row) It was cases like this that forced us to give up on using non-C-locale indexes for LIKE optimization, because the derived greater-than and less-than conditions have to be 100% correct for that. What we're still using make_greater_string for in non-C locales is to generate estimates about the selectivity of LIKE prefix conditions; for that, some amount of error is tolerable and so we just live with effects like the above. (We have to deal with the locale's sort order, whatever the heck it is, because that's what the pg_statistic histogram is computed in.) Now, having said that, I'm starting to wonder again why it's worth our trouble to fool with encoding-specific incrementers. The exactness of the estimates seems unlikely to be improved very much by doing this. >>> One random idea I have is - instead of generating > and < clauses, >>> could we define a "prefix match" operator - i.e. a ### b iff substr(a, >>> 1, length(b)) = b? �We'd need to do something about the selectivity, >>> but I don't see why that would be a problem. >> The problem is that you'd need to make that a btree-indexable operator. > Well, right. Without that, there's not much point. But do you think > that's prohibitively difficult? The problem is that you'd just be shifting all these same issues into the btree index machinery, which is not any better equipped to cope with them, and would not be a good place to be adding overhead. regards, tom lane
Re: [v9.2] make_greater_string() does not return a string in some cases
From
Kyotaro HORIGUCHI
Date:
Thank you for your understanding on that point. At Wed, 21 Sep 2011 20:35:02 -0400, Robert Haas <robertmhaas@gmail.com> wrote > ...while Kyotaro Horiguchi clearly feels otherwise, citing the > statistic that about 100 out of 7000 Japanese characters fail to work > properly: > > http://archives.postgresql.org/pgsql-bugs/2011-07/msg00064.php > > That statistic seems to justify some action, but what? Ideas: Addition to the figures - based on whole characters defined in JIS X 0208 which is traditionally (It is becoming a history now.) for information exchange in Japan - narrowing to commonly-used characters (named `Jouyou-Kanji' in Japanese, to be learned by high school graduates in Japan), 35 out of 2100 hits. # On the other hand, widening to JIS X 0213 which is roughly # compatible with the Unicode, and defines more than 12K chars, I # have not counted, but the additional 5k characters can be # assumed to have less probability to fail than the chars in JIS # X 0208. > 1. Adopt the patch as proposed, or something like it. > 2. Instead of installing encoding-specific character incrementing > functions, we could try to come up with a more reliable generic > algorithm. Not sure exactly what, though. > 3. Come up with some way to avoid needing to do this in the first place. > > One random idea I have is - instead of generating > and < clauses, > could we define a "prefix match" operator - i.e. a ### b iff substr(a, > 1, length(b)) = b? We'd need to do something about the selectivity, > but I don't see why that would be a problem. > > Thoughts? I am a newbie for PostgreSQL, but from a general view, I think that the most radical and clean way to fix this behavior is to make indexes to have the forward-matching function for strings in itself, with ignoreing possible overheads I don't know. This can save the all failures this patch has left unsaved, assuming that the `greater string' is not necessary to be a `valid string' just on searching btree. Another idea that I can guess is to add a new operator that means "examine if the string value is smaller than the `greater string' of the parameter.". This operator also can defer making `greater string' to just before searching btree or summing up histogram entries, or comparison with column values. If the assumption above is true, "making greater string" operation can be done in regardless of character encoding. This seems have smaller impact than "prefix match" operator. # But, mmm, The more investigating, the less difference it seems # for me to be... But It is out of my knowledge now, anyway.. I # need more study. On the other hand, if no additional encoding-specific `character increment function' will not come out, the modification of pg_wchar_table can be cancelled and make_greater_string will select the `character increment function' as 'switch (GetDatabaseEncoding()) { case PG_UTF8:.. }'. This get rid of the pg_generic_charinc tweak for libpq too. At Wed, 21 Sep 2011 21:49:27 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote > detail work; for instance, I noted an unconstrained memcpy into a 4-byte > local buffer, as well as lots and lots of violations of PG house style. > That's certainly all fixable but somebody will have to go through it. Sorry for the illegal style of the patch. I will confirm it. Regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Sep 22, 2011 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I'm a bit perplexed as to why we can't find a non-stochastic way of doing this. > > [ collations suck ] Ugh. > Now, having said that, I'm starting to wonder again why it's worth our > trouble to fool with encoding-specific incrementers. The exactness of > the estimates seems unlikely to be improved very much by doing this. Well, so the problem is that the frequency with which the algorithm fails altogether seems to be disturbingly high for certain kinds of characters. I agree it might not be that important to get the absolutely best next string, but it does seem important not to fail outright. Kyotaro Horiguchi gives the example of UTF-8 characters ending with 0xbf. >>>> One random idea I have is - instead of generating > and < clauses, >>>> could we define a "prefix match" operator - i.e. a ### b iff substr(a, >>>> 1, length(b)) = b? We'd need to do something about the selectivity, >>>> but I don't see why that would be a problem. > >>> The problem is that you'd need to make that a btree-indexable operator. > >> Well, right. Without that, there's not much point. But do you think >> that's prohibitively difficult? > > The problem is that you'd just be shifting all these same issues into > the btree index machinery, which is not any better equipped to cope with > them, and would not be a good place to be adding overhead. My thought was that it would avoid the need to do any character incrementing at all. You could just start scanning forward as if the operator were >= and then stop when you hit the first string that doesn't have the same initial substring. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 22, 2011 at 1:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > My thought was that it would avoid the need to do any character > incrementing at all. You could just start scanning forward as if the > operator were >= and then stop when you hit the first string that > doesn't have the same initial substring. But the whole problem is that not all the strings with the initial substring are in a contiguous block. The best we can hope for is that they're fairly dense within a block without too many non-matching strings. The example with / shows how that can happen. If you're looking for foo/% and you start with foo/ you'll find: foo/ foo0 foo/0 foo1 foo/1 ... Even just case-insensitive collations don't put all the strings with a common prefix in a contiguous block. If you're searching for foo% you'll find: foo Foobar foobar -- greg
On Thu, Sep 22, 2011 at 8:59 AM, Greg Stark <stark@mit.edu> wrote: > On Thu, Sep 22, 2011 at 1:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> My thought was that it would avoid the need to do any character >> incrementing at all. You could just start scanning forward as if the >> operator were >= and then stop when you hit the first string that >> doesn't have the same initial substring. > > But the whole problem is that not all the strings with the initial > substring are in a contiguous block. The best we can hope for is that > they're fairly dense within a block without too many non-matching > strings. The example with / shows how that can happen. > > If you're looking for foo/% and you start with foo/ you'll find: > > foo/ > foo0 > foo/0 > foo1 > foo/1 > ... > > Even just case-insensitive collations don't put all the strings with a > common prefix in a contiguous block. If you're searching for foo% > you'll find: > > foo > Foobar > foobar If that were true for the sorts of indexes we're using for LIKE queries, the existing approach wouldn't work either. All we're doing is translating: a LIKE 'foo/%' to a ~=>~ 'foo/%' AND a ~<~ 'foo0' ...where ~=>~ and ~<~ are just text-pattern-ops versions of => and < that ignore the normal collation rules and just compare bytes. In general, if we wanted to get rid of text_pattern_ops and make all of this work with arbitrary indexes, yeah, that would be very difficult. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Sep 22, 2011 at 8:59 AM, Greg Stark <stark@mit.edu> wrote: >> But the whole problem is that not all the strings with the initial >> substring are in a contiguous block. > If that were true for the sorts of indexes we're using for LIKE > queries, the existing approach wouldn't work either. Right. Since it's not a problem for the sorts of indexes with which we can use LIKE, moving knowledge of LIKE into the btree machinery doesn't buy us a darn thing, except more complexity in a place where we can ill afford it. The essential problem here is "when can you stop scanning, given a pattern with this prefix?", and btree doesn't know any more about that than make_greater_string does; it would in fact have to use make_greater_string or something isomorphic to it. regards, tom lane
On Thu, Sep 22, 2011 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The essential problem here is "when can you stop scanning, > given a pattern with this prefix?", and btree doesn't know any more > about that than make_greater_string does; it would in fact have to use > make_greater_string or something isomorphic to it. Hm, as long as btree_pattern_ops is the only opclass that behaves this way that's more or less true. But Robert's right that if btree just stops when it finds something that doesn't match it doesn't need to hard code any knowledge of what the "next" value would be. If there were any other op classes that had this abstract property of always putting strings with common prefixes in a contiguous block then it would continue to work without having to know where to find the boundaries of that contiguous block. Just as an example, if you had a pattern_ops opclass that sorted the string assuming it was in some other encoding like, say, EBCDIC, then make_greater_string would have to learn about it but Robert's model would just work. This isn't enitirely facetious. Sorting by EBCDIC ordering would be silly but I vague recall there being some examples that wouldn't be silly. And perhaps some collations could actually be marked as being acceptable even if they don't sort in pure ascii ordering and make_greater_string doesn't actually know about them.
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Sep 22, 2011 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Now, having said that, I'm starting to wonder again why it's worth our >> trouble to fool with encoding-specific incrementers. �The exactness of >> the estimates seems unlikely to be improved very much by doing this. > Well, so the problem is that the frequency with which the algorithm > fails altogether seems to be disturbingly high for certain kinds of > characters. I agree it might not be that important to get the > absolutely best next string, but it does seem important not to fail > outright. Kyotaro Horiguchi gives the example of UTF-8 characters > ending with 0xbf. [ thinks for a bit ... ] Yeah, it's certainly true that such a character might be relatively small in the overall sort order. The assumption underlying what we're doing now is that dropping the last character and incrementing the next-to-last one instead isn't terribly catastrophic from an estimation accuracy standpoint. I can see that there are cases where that would fail to be true, but I'm not exactly convinced that they're worse than all the other cases where we'll get a poor estimate. Anyway, I won't stand in the way of the patch as long as it's modified to limit the number of values considered for any one character position to something reasonably small. regards, tom lane
Greg Stark <stark@mit.edu> writes: > On Thu, Sep 22, 2011 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The essential problem here is "when can you stop scanning, >> given a pattern with this prefix?", and btree doesn't know any more >> about that than make_greater_string does; it would in fact have to use >> make_greater_string or something isomorphic to it. > Hm, as long as btree_pattern_ops is the only opclass that behaves this > way that's more or less true. But Robert's right that if btree just > stops when it finds something that doesn't match it doesn't need to > hard code any knowledge of what the "next" value would be. But you've added mechanism (and hence cycles) to btree searches, and *you haven't actually gained anything*. If the feature is restricted to only work for sort orderings in which common-prefix strings are contiguous, then it doesn't do anything we can't do just as well with the existing mechanism. Moreover, you'll still need make_greater_string because of the problem of trying to extract LIKE selectivity estimates from locale-dependent pg_statistic data. regards, tom lane
On Thu, Sep 22, 2011 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Anyway, I won't stand in the way of the patch as long as it's modified > to limit the number of values considered for any one character position > to something reasonably small. One thing I was thinking about is that it would be useful to have some metric for judging how well any given algorithm that we might pick here actually works. For example, if we were to try all possible three character strings in some encoding and run make_greater_string() on each one of them, we could then measure the failure percentage. Or if that's too many cases to crank through then we could limit it some way - but the point is, without some kind of test harness here, we have no way of measuring the trade-off between spending more CPU time and improving accuracy. Maybe you have a better feeling for what's reasonable there than I do, but I'm not prepared to take a stab in the dark without benefit of some real measurements. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > One thing I was thinking about is that it would be useful to have some > metric for judging how well any given algorithm that we might pick > here actually works. Well, the metric that we were indirectly using earlier was the number of characters in a given locale for which the algorithm fails to find a greater one (excluding whichever character is "last", I guess, or you could just recognize there's always at least one). > For example, if we were to try all possible > three character strings in some encoding and run make_greater_string() > on each one of them, we could then measure the failure percentage. Or > if that's too many cases to crank through then we could limit it some > way - Even in UTF8 there's only a couple million assigned code points, so for test purposes anyway it doesn't seem like we couldn't crank through them all. Also, in many cases you could probably figure it out by analysis instead of brute-force testing every case. A more reasonable objection might be that a whole lot of those code points are things nobody cares about, and so we need to weight the results somehow by the actual popularity of the character. Not sure how to take that into account. Another issue here is that we need to consider not just whether we find a greater character, but "how much greater" it is. This would apply to my suggestion of incrementing the top byte without considering lower-order bytes --- we'd be skipping quite a lot of code space for each increment, and it's conceivable that that would be quite hurtful in some cases. Not sure how to account for that either. An extreme example here is an "incrementer" that just immediately returns the last character in the sort order for any lesser input. regards, tom lane
On Thu, Sep 22, 2011 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, the metric that we were indirectly using earlier was the > number of characters in a given locale for which the algorithm > fails to find a greater one (excluding whichever character is "last", > I guess, or you could just recognize there's always at least one). What about characters that sort differently in sequence than individually? >> For example, if we were to try all possible >> three character strings in some encoding and run make_greater_string() >> on each one of them, we could then measure the failure percentage. Or >> if that's too many cases to crank through then we could limit it some >> way - > > Even in UTF8 there's only a couple million assigned code points, so for > test purposes anyway it doesn't seem like we couldn't crank through them > all. Also, in many cases you could probably figure it out by analysis > instead of brute-force testing every case. > > A more reasonable objection might be that a whole lot of those code > points are things nobody cares about, and so we need to weight the > results somehow by the actual popularity of the character. Not sure > how to take that into account. I guess whether that's a problem in practice will depend somewhat on the quality of the algorithms we're able to find. If our best algorithms still have a 1% failure rate, then yeah, that's an issue, but in that case I'd suggest that our best algorithms suck and we need to think harder about alternate solutions. If we're talking about failing on 5 characters out of a million we can just eyeball them. I'm not trying to reduce this testing to something that is entirely mechanic in every way; I'm just saying that I'm not optimistic about my ability to judge which algorithms will work best in practice without some kind of automated aid. > Another issue here is that we need to consider not just whether we find > a greater character, but "how much greater" it is. This would apply to > my suggestion of incrementing the top byte without considering > lower-order bytes --- we'd be skipping quite a lot of code space for > each increment, and it's conceivable that that would be quite hurtful in > some cases. Not sure how to account for that either. An extreme > example here is an "incrementer" that just immediately returns the last > character in the sort order for any lesser input. Right... well, this is why I'm not wild about doing this by incrementing in the first place. But now that I think about it, what about using some slightly-less-stupid version of that approach as a fallback strategy? For example, we could pick, oh, say, 20 characters out of the space of code points, about evenly distributed under whatever collations we think are likely to be in use. In the incrementer, we try some kind of increment-the-bytes strategy for a while and if it doesn't pan out, we zip through the array and try substituting each of the fallback characters. If more than one works, we test the survivors against each other until we're left with just one winner. The bound might not be real tight, but as long as it's good enough to make the planner pick an index scan it might not matter very much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Sep 22, 2011 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, the metric that we were indirectly using earlier was the >> number of characters in a given locale for which the algorithm >> fails to find a greater one (excluding whichever character is "last", >> I guess, or you could just recognize there's always at least one). > What about characters that sort differently in sequence than individually? Yeah, there's a whole 'nother set of issues there, but the character incrementer is unlikely to affect that very much either way, I think. > But now that I think about it, what about using some > slightly-less-stupid version of that approach as a fallback strategy? > For example, we could pick, oh, say, 20 characters out of the space of > code points, about evenly distributed under whatever collations we > think are likely to be in use. Sure, if the "increment the top byte" strategy proves to not accomplish that effectively. But I'd prefer not to design a complex strategy until it's been proven that a simpler one doesn't work. regards, tom lane
Re: [v9.2] make_greater_string() does not return a string in some cases
From
Kyotaro HORIGUCHI
Date:
Hi, I think I have comprehended roughly around the constructs and the concept underlying. At Thu, 22 Sep 2011 12:35:56 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <23159.1316709356@sss.pgh.pa.us> tgl> Sure, if the "increment the top byte" strategy proves to not accomplish tgl> that effectively. But I'd prefer not to design a complex strategy until tgl> it's been proven that a simpler one doesn't work. Ok, I understand indistinctly that thought. But I have not grasp your measure for the complexity. The current make_greater_string tips up the tail of bare byte sequence and cheks the whole byte sequence to be valid against the database encoding and try the next if not. On the other hand, the patch (although the style is corrupted..) searches for the last CHARACTER and try to tipping the last CARACTER up and decline if failed. Looking within the scope of the function make_greater_string, feel more complexity on the former because of the check and loop construct. Yes, altough the `tipping the character up' has complexity within, but the complexity is capsulated within single function. From the performance perspective, the current implement always slipps 64 times (0xff - 0xbf, for UTF8) and checks the WHOLE pattern string on every slippage, and eventually declines for the only but not negligible 100 (within Japanese chars only) code points. The check-and-retry loop can't be a help for these cases. And checks the whole pattern string at least once nevertheless successfully landed. While the algorithm of the patch seeks the whole pattern string to find the last character but makes no slippage for whole the code space and declines only on the point of chainging the character length. (Of cource it is possible to save thses points but it is `too expensive' for the gain to me:). Not only checking the whole string, but also checking the character after increment operation is essentially needless for this method. To summarise from my view, these two methods seems not so different on performance for the `incrementable's by current method and the latter seems rather efficient and applicable for most of the `unincrementable's. The patch now does cheking the validity of the result as last-resort because of the possible inconsistency caused by careless chainging of the validity check function (changing the character set, in other word, very unlikely.). But It is unnessessary itself if the consistency between the incrementer and the validator has been checked after the last modification. The four-bytes memcpy would be get out by changing the rewinding method. These modifications make the operation gets low-cost (I think..) for UTF8 and it for others left untouched with regard to the behavior. As I've written in previous message, the modification on pg_wchar_table can be rewinded before until another incrementer comes. Can I have another chance to show the another version of the patch according to the above? Regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Sep 22, 2011 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Anyway, I won't stand in the way of the patch as long as it's modified > to limit the number of values considered for any one character position > to something reasonably small. I think that limit in both the old and new code is 1, except that the new code does it more efficiently. Am I confused? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 23, 2011 at 5:16 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@oss.ntt.co.jp> wrote: > Can I have another chance to show the another version of the > patch according to the above? You can always post a new version of any patch. I think what you need to focus on is cleaning up the coding style to match PG project standards. In particular, you need to fix your brace positioning and the way function declarations are laid out. Someone will have to do that before this is committed, and ideally that person should be you rather than, say, me. :-) It would be good to fix the memcpy() issue Tom found too, and make a careful examination of the code for anything else that can be tidied up or made more bulletproof. It seems like we are getting close to agreement on the basic behavior, so that is good. We can always fine-tune that later; that shouldn't be much work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Sep 22, 2011 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Anyway, I won't stand in the way of the patch as long as it's modified >> to limit the number of values considered for any one character position >> to something reasonably small. > I think that limit in both the old and new code is 1, except that the > new code does it more efficiently. > Am I confused? Yes, or else I am. Consider a 4-byte UTF8 character at the end of the string. The existing code increments the last byte up to 255 (rejecting everything past 0xBF), then gives up and truncates that character away. So the maximum number of tries for that character position is between 0 and 127 depending on what the original character was (with at most 63 of the incremented values getting past the verifymbstr test). The proposed patch is going to iterate through all Unicode code points up to U+7FFFFF before giving up. Since it's possible that we need to increment something further left to succeed at all, this doesn't seem like a good plan. regards, tom lane
On Fri, Sep 23, 2011 at 8:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Sep 22, 2011 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Anyway, I won't stand in the way of the patch as long as it's modified >>> to limit the number of values considered for any one character position >>> to something reasonably small. > >> I think that limit in both the old and new code is 1, except that the >> new code does it more efficiently. > >> Am I confused? > > Yes, or else I am. Consider a 4-byte UTF8 character at the end of the > string. The existing code increments the last byte up to 255 (rejecting > everything past 0xBF), then gives up and truncates that character away. > So the maximum number of tries for that character position is between 0 > and 127 depending on what the original character was (with at most 63 of > the incremented values getting past the verifymbstr test). > > The proposed patch is going to iterate through all Unicode code points > up to U+7FFFFF before giving up. Since it's possible that we need to > increment something further left to succeed at all, this doesn't seem > like a good plan. I think you're misreading the code. It does this: while (len > 0) { boring stuff; if (charincfunc(lastchar, charlen)) { more boring stuff; if (we made a greater string) return it; cleanup; } truncate away last character; } I don't see how that's ever going to try more than one character in the same position. What may be confusing you is that the old code has two loops: an outer loop that tests whether we've made a greater string, and an inner loop that tests whether we've made a validly encoded string at all. In the new code, at least in the UTF-8 case, the inner loop is GONE altogether. Instead of iterating until we construct a valid character, we just use our mad UTF-8 skillz to assemble one, and return it. Or else I need to go drink a few cups of tea and look at this again. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
One idea: col like 'foo%' could be translated to col >= 'foo' and col <= foo || 'zzz' , where 'z' is the largest possible character.This should be good enough for calculating stats. How to find such a character, i do not know.
On fre, 2011-09-23 at 20:35 +0300, Marcin Mańk wrote: > One idea: > col like 'foo%' could be translated to col >= 'foo' and col <= foo || 'zzz' , where 'z' is the largest possible character.This should be good enough for calculating stats. > > How to find such a character, i do not know. That's what makes this so difficult. If we knew the largest character, we could probably also find the largest-1, largest-2, etc. characters and determine the total order of everything.
Peter Eisentraut <peter_e@gmx.net> writes: > On fre, 2011-09-23 at 20:35 +0300, Marcin Mańk wrote: >> One idea: >> col like 'foo%' could be translated to col >= 'foo' and col <= foo || 'zzz' , where 'z' is the largest possible character.This should be good enough for calculating stats. >> How to find such a character, i do not know. > That's what makes this so difficult. > If we knew the largest character, we could probably also find the > largest-1, largest-2, etc. characters and determine the total order of > everything. No, it's a hundred times worse than that, because in collations other than C there typically *is* no total order. The collation behavior of many characters is context-sensitive, thanks to the multi-pass behavior of typical "dictionary" algorithms. regards, tom lane
On mån, 2011-09-26 at 10:08 -0400, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > On fre, 2011-09-23 at 20:35 +0300, Marcin Mańk wrote: > >> One idea: > >> col like 'foo%' could be translated to col >= 'foo' and col <= foo || 'zzz' , where 'z' is the largest possible character.This should be good enough for calculating stats. > >> How to find such a character, i do not know. > > > That's what makes this so difficult. > > > If we knew the largest character, we could probably also find the > > largest-1, largest-2, etc. characters and determine the total order of > > everything. > > No, it's a hundred times worse than that, because in collations other > than C there typically *is* no total order. The collation behavior of > many characters is context-sensitive, thanks to the multi-pass behavior > of typical "dictionary" algorithms. Well, there is a total order of all strings, but it's not consistent under string concatenation. But there is a "largest character". If the collation implementation uses four weights (the typical case), the largest character is the one that maps to <FFFF> <FFFF> <FFFF> <FFFF>. If you appended that character to a string, you would get a larger string. (Unless there are French backwards levels or other funny things in place, perhaps.) But we don't know which character that is, and likely there isn't one, so we'd need to largest character that maps to an actually assigned weight, and that's not possible without exhaustive search of all collating elements. We could possibly try to make this whole thing work differently by storing the strxfrm results in the histograms.
Peter Eisentraut <peter_e@gmx.net> writes: > On mån, 2011-09-26 at 10:08 -0400, Tom Lane wrote: >> No, it's a hundred times worse than that, because in collations other >> than C there typically *is* no total order. The collation behavior of >> many characters is context-sensitive, thanks to the multi-pass behavior >> of typical "dictionary" algorithms. > Well, there is a total order of all strings, but it's not consistent > under string concatenation. > But there is a "largest character". If the collation implementation > uses four weights (the typical case), the largest character is the one > that maps to <FFFF> <FFFF> <FFFF> <FFFF>. If you appended that > character to a string, you would get a larger string. (Unless there are > French backwards levels or other funny things in place, perhaps.) But the problem is not "make a string greater than this given one". It is "make a string greater than any string that begins with this given one". As an example, suppose we are given "xyz" where "z" is the last letter in the collation. We can probably find characters such as "~" that are greater than "z", but no string x-y-nonletter is going to be considered greater than x-y-z-z by a dictionary sorting method. This is why make_greater_string has to be prepared to give up and go increment some character before the last one: the only way to succeed for this input is to construct "xz". regards, tom lane
Re: [v9.2] make_greater_string() does not return a string in some cases
From
Kyotaro HORIGUCHI
Date:
This is new version of make_greater_string patch. 1. wchar.c:1532 pg_wchar_table: Restore the pg_wchar_table. 2. wchar.c:1371 pg_utf8_increment: Remove dangerous memcpy, but one memcpy is left because it's safe. Remove code checkafter increment. 3. wchar.c:1429 pg_eucjp_increment: Remove dangerous memcpy. Remove code check after increment. Minor bug fix. 4. wchar.c:1654 pg_database_encoding_character_incrementer: Select increment function by switch-select. horiguchi.kyotaro> This is rebased patch of `Allow encoding specific character horiguchi.kyotaro> incrementer'(https://commitfest.postgresql.org/action/patch_view?id=602). -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 3e84679..593dba6 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -5653,6 +5653,18 @@ pattern_selectivity(Const *patt, Pattern_Type ptype)/* + * This function is "character increment" function for bytea used in + * make_greater_string() that has same interface with pg_wchar_tbl.charinc. + */ +static bool byte_increment(unsigned char *ptr, int len) +{ + if (*ptr >= 255) return false; + + (*ptr)++; + return true; +} + +/* * Try to generate a string greater than the given string or any * string it is a prefix of. If successful, return apalloc'd string * in the form of a Const node; else return NULL. @@ -5691,6 +5703,7 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) int len; Datum cmpstr; text *cmptxt = NULL; + character_incrementer charincfunc; /* * Get a modifiable copy of the prefix string in C-string format, and set @@ -5752,27 +5765,38 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) } } + if (datatype != BYTEAOID) + charincfunc = pg_database_encoding_character_incrementer(); + else + charincfunc = &byte_increment; + while (len > 0) { - unsigned char *lastchar = (unsigned char *) (workstr + len - 1); - unsigned char savelastchar = *lastchar; + int charlen; + unsigned char *lastchar; + unsigned char savelastbyte; + Const *workstr_const; + + if (datatype == BYTEAOID) + charlen = 1; + else + charlen = len - pg_mbcliplen(workstr, len, len - 1); + + lastchar = (unsigned char *) (workstr + len - charlen); /* - * Try to generate a larger string by incrementing the last byte. + * savelastbyte has meaning only for datatype == BYTEAOID */ - while (*lastchar < (unsigned char) 255) - { - Const *workstr_const; + savelastbyte = *lastchar; - (*lastchar)++; + /* + * Try to generate a larger string by incrementing the last byte or + * character. + */ + if (charincfunc(lastchar, charlen)) { if (datatype != BYTEAOID) - { - /* do not generate invalid encoding sequences */ - if (!pg_verifymbstr(workstr, len, true)) - continue; workstr_const = string_to_const(workstr, datatype); - } else workstr_const = string_to_bytea_const(workstr, len); @@ -5787,26 +5811,17 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) pfree(workstr); return workstr_const; } - + /* No good, release unusable value and try again */ pfree(DatumGetPointer(workstr_const->constvalue)); pfree(workstr_const); } - /* restore last byte so we don't confuse pg_mbcliplen */ - *lastchar = savelastchar; - /* - * Truncate off the last character, which might be more than 1 byte, - * depending on the character encoding. + * Truncate off the last character or restore last byte for BYTEA. */ - if (datatype != BYTEAOID && pg_database_encoding_max_length() > 1) - len = pg_mbcliplen(workstr, len, len - 1); - else - len -= 1; - - if (datatype != BYTEAOID) - workstr[len] = '\0'; + len -= charlen; + workstr[len] = (datatype != BYTEAOID ? '\0' : savelastbyte); } /* Failed... */ diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c index f23732f..d6dc717 100644 --- a/src/backend/utils/mb/wchar.c +++ b/src/backend/utils/mb/wchar.c @@ -1336,6 +1336,195 @@ pg_utf8_islegal(const unsigned char *source, int length)/* *------------------------------------------------------------------- + * character incrementer + * + * These functions accept "charptr", a pointer to the first byte of a + * maybe-multibyte character. Try `increment' the character and return true if + * successed. If these functions returns false, the character should be + * untouched. These functions must be implemented in correspondence with + * verifiers, in other words, the rewrited character by this function must pass + * the check by pg_*_verifier() if returns true. Returning the return value of + * pg_*_verifier() corresponding can finnaly avoid such a inconsistency when + * something wrong. + * ------------------------------------------------------------------- + */ + +#ifndef FRONTEND +static bool pg_generic_charinc(unsigned char *charptr, int len) +{ + unsigned char *lastchar = (unsigned char *) (charptr + len - 1); + unsigned char savelastchar = *lastchar; + const char *const_charptr = (const char *)charptr; + + while (*lastchar < (unsigned char) 255) + { + (*lastchar)++; + if (!pg_verifymbstr(const_charptr, len, true)) + continue; + return true; + } + + *lastchar = savelastchar; + return false; +} + +static bool +pg_utf8_increment(unsigned char *charptr, int length) +{ + unsigned char a; + unsigned char bak[4]; + bool success; + + switch (length) + { + default: + /* reject lengths 5 and 6 for now */ + return false; + case 4: + bak[3] = charptr[3]; + a = charptr[3]; + if (a < 0xBF) + { + charptr[3]++; + break; + } + charptr[3] = 0x80; + /* FALL THRU */ + case 3: + bak[2] = charptr[2]; + a = charptr[2]; + if (a < 0xBF) + { + charptr[2]++; + break; + } + charptr[2] = 0x80; + /* FALL THRU */ + case 2: + bak[1] = charptr[1]; + a = charptr[1]; + if ((*charptr == 0xed && a < 0x9F) || a < 0xBF) + { + charptr[1]++; + break; + } + charptr[1] = 0x80; + /* FALL THRU */ + case 1: + bak[0] = *charptr; + a = *charptr; + if (a == 0x7F || a == 0xDF || a == 0xEF || a == 0xF7) { + /* Rewinding modified bytes and return fail. length is + * confirmed to be between 1 and 4 here. */ + memcpy(charptr, bak, length); + return false; + } + charptr[0]++; + break; + } + + return true; +} + +static bool +pg_eucjp_increment(unsigned char *charptr, int length) +{ + unsigned char bak[3]; + bool success; + unsigned char c1, c2; + signed int i; + + c1 = *charptr; + + switch (c1) + { + case SS2: /* JIS X 0201 */ + if (length != 2) return false; + + c2 = charptr[1]; + + if (c2 > 0xde) + charptr[0] = charptr[1] = 0xa1; + else if (c2 < 0xa1) + charptr[1] = 0xa1; + else + charptr[1]++; + + break; + + case SS3: /* JIS X 0212 */ + if (length != 3) return false; + + for (i = 2 ; i > 0 ; i--) + { + bak[i] = charptr[i]; + c2 = charptr[i]; + if (c2 < 0xa1) + { + charptr[i] = 0xa1; + return true; + } + else if (c2 < 0xfe) + { + charptr[i]++; + break; + } + charptr[i] = 0xa1; + } + + + if (i == 0) /* Out of 3-byte code region */ + { + charptr[1] = bak[1]; + charptr[2] = bak[2]; + return false; + } + + break; + + default: + if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */ + { + if (length != 2) return false; + + for (i = 1 ; i >= 0 ; i--) /* i must be signed */ + { + bak[i] = charptr[i]; + c2 = charptr[i]; + if (c2 < 0xa1) + { + charptr[i] = 0xa1; + return true; + } + else if (c2 < 0xfe) + { + charptr[i]++; + break; + } + charptr[i] = 0xa1; + } + + if (i < 0) /* Out of 2 byte code region */ + { + charptr[0] = bak[0]; + charptr[1] = bak[1]; + return false; + } + } + else + { /* ASCII, single byte */ + if (c1 > 0x7e) + return false; + (*charptr)++; + } + } + + return true; +} +#endif + +/* + *------------------------------------------------------------------- * encoding info table * XXX must be sorted by thesame order as enum pg_enc (in mb/pg_wchar.h) *------------------------------------------------------------------- @@ -1459,6 +1648,24 @@ pg_database_encoding_max_length(void)}/* + * give the character incrementer for the encoding for the current database + */ +character_incrementer +pg_database_encoding_character_incrementer(void) +{ + switch (GetDatabaseEncoding()) { + case PG_UTF8: + return pg_utf8_increment; + + case PG_EUC_JP: + return pg_eucjp_increment; + + default: + return pg_generic_charinc; + } +} + +/* * Verify mbstr to make sure that it is validly encoded in the current * database encoding. Otherwise same as pg_verify_mbstr().*/ diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 826c7af..728175c 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -284,6 +284,8 @@ typedef int (*mblen_converter) (const unsigned char *mbstr);typedef int (*mbdisplaylen_converter) (constunsigned char *mbstr); +typedef bool (*character_incrementer) (unsigned char *mbstr, int len); +typedef int (*mbverifier) (const unsigned char *mbstr, int len);typedef struct @@ -389,6 +391,7 @@ extern int pg_encoding_mbcliplen(int encoding, const char *mbstr,extern int pg_mbcharcliplen(constchar *mbstr, int len, int imit);extern int pg_encoding_max_length(int encoding);extern int pg_database_encoding_max_length(void); +extern character_incrementer pg_database_encoding_character_incrementer(void);extern int PrepareClientEncoding(int encoding);externint SetClientEncoding(int encoding);
On Thu, Sep 29, 2011 at 6:24 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@oss.ntt.co.jp> wrote: > This is new version of make_greater_string patch. According to the comments in the original source code, the purpose of savelastchar is to avoid confusing pg_mbcliplen(). You've preserved savelastchar only for the case where datatype == BYTEAOID, while making it the increment function's job not to do anything permanent unless it also returns true. But it seems to me that if the datatype is BYTEAOID then there's no need to restore anything at all, because we're not going to call pg_mbcliplen() in that case anyway. So I think the logic here can be simplified. Also, you haven't completely fixed the style issues. Function definitions should look like this: static void thingy() { } Not like this: static void thingy() { } Opening curly braces should be on a line by themselves, not at the end of the preceding if, while, etc. line. "finnaly" is spelled incorrectly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Oct 3, 2011 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 29, 2011 at 6:24 AM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@oss.ntt.co.jp> wrote: >> This is new version of make_greater_string patch. > > According to the comments in the original source code, the purpose of > savelastchar is to avoid confusing pg_mbcliplen(). You've preserved > savelastchar only for the case where datatype == BYTEAOID, while > making it the increment function's job not to do anything permanent > unless it also returns true. But it seems to me that if the datatype > is BYTEAOID then there's no need to restore anything at all, because > we're not going to call pg_mbcliplen() in that case anyway. So I > think the logic here can be simplified. > > Also, you haven't completely fixed the style issues. Function > definitions should look like this: > > static void > thingy() > { > } > > Not like this: > > static void thingy() > { > } > > Opening curly braces should be on a line by themselves, not at the end > of the preceding if, while, etc. line. > > "finnaly" is spelled incorrectly. Oh, and there's this: wchar.c: In function ‘pg_utf8_increment’: wchar.c:1376: warning: unused variable ‘success’ wchar.c: In function ‘pg_eucjp_increment’: wchar.c:1433: warning: unused variable ‘success’ -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [v9.2] make_greater_string() does not return a string in some cases
From
Kyotaro HORIGUCHI
Date:
Thank you for reviewing. The new version of this patch is attached to this message. > > But it seems to me that if the datatype is BYTEAOID then > > there's no need to restore anything at all, because we're not > > going to call pg_mbcliplen() in that case anyway. So I think > > the logic here can be simplified. I agree with you. The original code does not restore the changed byte. I removed the lastchar preservation from make_greater_string(). > > Also, you haven't completely fixed the style issues. Function > > definitions should look like this: .. > > Opening curly braces should be on a line by themselves, not at the end > > of the preceding if, while, etc. line. > > > > "finnaly" is spelled incorrectly. I'm very sorry for left mixed defferent style a lot. I think I've done the correction of the styles for function definition and braces. The misspelled word is removed with whole sentense because re-cheking just before return had been removed. > Oh, and there's this: > > wchar.c: In function ‘pg_utf8_increment’: > wchar.c:1376: warning: unused variable ‘success’ > wchar.c: In function ‘pg_eucjp_increment’: > wchar.c:1433: warning: unused variable ‘success’ Oops... I have rebased the patch and removed all warnings. make check has been passed all-ok. I confirmed that the planner decides to use index with proper boundaries for like expression with the certain characters on problematic code point, on the database encodings both UTF-8 and EUC-JP with the database locale is C, and database locale is ja_JP.UTF8. And also for bytea ends with 0xff and 0x00, 0x01. This is the third version of the patch. Regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 8ceea82..59f8c37 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -5664,6 +5664,19 @@ pattern_selectivity(Const *patt, Pattern_Type ptype)/* + * This function is "character increment" function for bytea used in + * make_greater_string() that has same interface with pg_wchar_tbl.charinc. + */ +static bool +byte_increment(unsigned char *ptr, int len) +{ + if (*ptr >= 255) return false; + + (*ptr)++; + return true; +} + +/* * Try to generate a string greater than the given string or any * string it is a prefix of. If successful, return apalloc'd string * in the form of a Const node; else return NULL. @@ -5702,6 +5715,7 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) int len; Datum cmpstr; text *cmptxt = NULL; + character_incrementer charincfunc; /* * Get a modifiable copy of the prefix string in C-string format, and set @@ -5763,29 +5777,33 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) } } + if (datatype == BYTEAOID) + charincfunc = &byte_increment; + else + charincfunc = pg_database_encoding_character_incrementer(); + while (len > 0) { - unsigned char *lastchar = (unsigned char *) (workstr + len - 1); - unsigned char savelastchar = *lastchar; + int charlen = 1; + unsigned char *lastchar; + Const *workstr_const; + + if (datatype != BYTEAOID) + charlen = len - pg_mbcliplen(workstr, len, len - 1); + + lastchar = (unsigned char *) (workstr + len - charlen); /* - * Try to generate a larger string by incrementing the last byte. + * Try to generate a larger string by incrementing the last byte or + * character. */ - while (*lastchar < (unsigned char) 255) - { - Const *workstr_const; - - (*lastchar)++; - if (datatype != BYTEAOID) - { - /* do not generate invalid encoding sequences */ - if (!pg_verifymbstr(workstr, len, true)) - continue; - workstr_const = string_to_const(workstr, datatype); - } - else + if (charincfunc(lastchar, charlen)) + { + if (datatype == BYTEAOID) workstr_const = string_to_bytea_const(workstr, len); + else + workstr_const = string_to_const(workstr, datatype); if (DatumGetBool(FunctionCall2Coll(ltproc, collation, @@ -5804,20 +5822,10 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) pfree(workstr_const); } - /* restore last byte so we don't confuse pg_mbcliplen */ - *lastchar = savelastchar; - /* - * Truncate off the last character, which might be more than 1 byte, - * depending on the character encoding. + * Truncate off the last character or byte. */ - if (datatype != BYTEAOID && pg_database_encoding_max_length() > 1) - len = pg_mbcliplen(workstr, len, len - 1); - else - len -= 1; - - if (datatype != BYTEAOID) - workstr[len] = '\0'; + len -= charlen; } /* Failed... */ diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c index f23732f..e8a1bc8 100644 --- a/src/backend/utils/mb/wchar.c +++ b/src/backend/utils/mb/wchar.c @@ -1336,6 +1336,194 @@ pg_utf8_islegal(const unsigned char *source, int length)/* *------------------------------------------------------------------- + * character incrementer + * + * These functions accept "charptr", a pointer to the first byte of a + * maybe-multibyte character. Try `increment' the character and return + * true if successed. If these functions returns false, the character + * should be untouched. These functions must be implemented in + * correspondence with verifiers, in other words, the rewrited + * character by this function must pass the check by pg_*_verifier() + * if returns true. + * ------------------------------------------------------------------- + */ + +#ifndef FRONTEND +static bool +pg_generic_charinc(unsigned char *charptr, int len) +{ + unsigned char *lastchar = (unsigned char *) (charptr + len - 1); + unsigned char savelastchar = *lastchar; + const char *const_charptr = (const char *)charptr; + + while (*lastchar < (unsigned char) 255) + { + (*lastchar)++; + if (!pg_verifymbstr(const_charptr, len, true)) + continue; + return true; + } + + *lastchar = savelastchar; + return false; +} + +static bool +pg_utf8_increment(unsigned char *charptr, int length) +{ + unsigned char a; + unsigned char bak[4]; + + switch (length) + { + default: + /* reject lengths 5 and 6 for now */ + return false; + case 4: + bak[3] = charptr[3]; + a = charptr[3]; + if (a < 0xBF) + { + charptr[3]++; + break; + } + charptr[3] = 0x80; + /* FALL THRU */ + case 3: + bak[2] = charptr[2]; + a = charptr[2]; + if (a < 0xBF) + { + charptr[2]++; + break; + } + charptr[2] = 0x80; + /* FALL THRU */ + case 2: + bak[1] = charptr[1]; + a = charptr[1]; + if ((*charptr == 0xed && a < 0x9F) || a < 0xBF) + { + charptr[1]++; + break; + } + charptr[1] = 0x80; + /* FALL THRU */ + case 1: + bak[0] = *charptr; + a = *charptr; + if (a == 0x7F || a == 0xDF || a == 0xEF || a == 0xF7) + { + /* Rewinding modified bytes and return fail. length is + * confirmed to be between 1 and 4 here. */ + memcpy(charptr, bak, length); + return false; + } + charptr[0]++; + break; + } + + return true; +} + +static bool +pg_eucjp_increment(unsigned char *charptr, int length) +{ + unsigned char bak[3]; + unsigned char c1, c2; + signed int i; + + c1 = *charptr; + + switch (c1) + { + case SS2: /* JIS X 0201 */ + if (length != 2) return false; + + c2 = charptr[1]; + + if (c2 > 0xde) + charptr[0] = charptr[1] = 0xa1; + else if (c2 < 0xa1) + charptr[1] = 0xa1; + else + charptr[1]++; + + break; + + case SS3: /* JIS X 0212 */ + if (length != 3) return false; + + for (i = 2 ; i > 0 ; i--) + { + bak[i] = charptr[i]; + c2 = charptr[i]; + if (c2 < 0xa1) + { + charptr[i] = 0xa1; + return true; + } + else if (c2 < 0xfe) + { + charptr[i]++; + break; + } + charptr[i] = 0xa1; + } + + + if (i == 0) /* Out of 3-byte code region */ + { + charptr[1] = bak[1]; + charptr[2] = bak[2]; + return false; + } + + break; + + default: + if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */ + { + if (length != 2) return false; + + for (i = 1 ; i >= 0 ; i--) /* i must be signed */ + { + bak[i] = charptr[i]; + c2 = charptr[i]; + if (c2 < 0xa1) + { + charptr[i] = 0xa1; + return true; + } + else if (c2 < 0xfe) + { + charptr[i]++; + break; + } + charptr[i] = 0xa1; + } + + if (i < 0) /* Out of 2 byte code region */ + { + charptr[0] = bak[0]; + charptr[1] = bak[1]; + return false; + } + } + else + { /* ASCII, single byte */ + if (c1 > 0x7e) + return false; + (*charptr)++; + } + } + + return true; +} +#endif + +/* + *------------------------------------------------------------------- * encoding info table * XXX must be sorted by thesame order as enum pg_enc (in mb/pg_wchar.h) *------------------------------------------------------------------- @@ -1459,6 +1647,25 @@ pg_database_encoding_max_length(void)}/* + * give the character incrementer for the encoding for the current database + */ +character_incrementer +pg_database_encoding_character_incrementer(void) +{ + switch (GetDatabaseEncoding()) + { + case PG_UTF8: + return pg_utf8_increment; + + case PG_EUC_JP: + return pg_eucjp_increment; + + default: + return pg_generic_charinc; + } +} + +/* * Verify mbstr to make sure that it is validly encoded in the current * database encoding. Otherwise same as pg_verify_mbstr().*/ diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 826c7af..728175c 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -284,6 +284,8 @@ typedef int (*mblen_converter) (const unsigned char *mbstr);typedef int (*mbdisplaylen_converter) (constunsigned char *mbstr); +typedef bool (*character_incrementer) (unsigned char *mbstr, int len); +typedef int (*mbverifier) (const unsigned char *mbstr, int len);typedef struct @@ -389,6 +391,7 @@ extern int pg_encoding_mbcliplen(int encoding, const char *mbstr,extern int pg_mbcharcliplen(constchar *mbstr, int len, int imit);extern int pg_encoding_max_length(int encoding);extern int pg_database_encoding_max_length(void); +extern character_incrementer pg_database_encoding_character_incrementer(void);extern int PrepareClientEncoding(int encoding);externint SetClientEncoding(int encoding);
On Fri, Oct 7, 2011 at 12:22 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@oss.ntt.co.jp> wrote: > Thank you for reviewing. > > The new version of this patch is attached to this message. OK, I think this is reasonably close to being committable now. There are a few remaining style and grammar mistakes but I can fix those up before committing. One thing I still think it would be useful to add, though, is some comments to pg_utf8_increment() and pg_eucjp_increment() describing the algorithm being used. Can you take a crack at that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [v9.2] make_greater_string() does not return a string in some cases
From
Kyotaro HORIGUCHI
Date:
At Fri, 7 Oct 2011 12:25:08 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmoao2OoZBMUsfp3ZC0_LgXsv3jBvY9eYR5h+cZYez7j26Q@mail.gmail.com> > OK, I think this is reasonably close to being committable now. There > are a few remaining style and grammar mistakes but I can fix those up > before committing. Thank you. > One thing I still think it would be useful to add, > though, is some comments to pg_utf8_increment() and > pg_eucjp_increment() describing the algorithm being used. Can you > take a crack at that? Yes I'll do it in a day or two. Regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [v9.2] make_greater_string() does not return a string in some cases
From
Kyotaro HORIGUCHI
Date:
Hello, the work is finished. Version 4 of the patch is attached to this message. - Add rough description of the algorithm as comment to pg_utf8_increment() and pg_eucjp_increment(). - Fixed a bug of pg_utf8_increment() found while working. pg_(utf8|eucjp)_increment are retested on whole valid code pointsto be properly handled. - The comment previously pointed out as being wrong in grammar is left untouched. I'm sorry to bother you with my poor English. At Tue, 11 Oct 2011 16:55:00 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@oss.ntt.co.jp> wrote > > One thing I still think it would be useful to add, > > though, is some comments to pg_utf8_increment() and > > pg_eucjp_increment() describing the algorithm being used. Can you > > take a crack at that? > > Yes I'll do it in a day or two. Regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 8ceea82..59f8c37 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -5664,6 +5664,19 @@ pattern_selectivity(Const *patt, Pattern_Type ptype)/* + * This function is "character increment" function for bytea used in + * make_greater_string() that has same interface with pg_wchar_tbl.charinc. + */ +static bool +byte_increment(unsigned char *ptr, int len) +{ + if (*ptr >= 255) return false; + + (*ptr)++; + return true; +} + +/* * Try to generate a string greater than the given string or any * string it is a prefix of. If successful, return apalloc'd string * in the form of a Const node; else return NULL. @@ -5702,6 +5715,7 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) int len; Datum cmpstr; text *cmptxt = NULL; + character_incrementer charincfunc; /* * Get a modifiable copy of the prefix string in C-string format, and set @@ -5763,29 +5777,33 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) } } + if (datatype == BYTEAOID) + charincfunc = &byte_increment; + else + charincfunc = pg_database_encoding_character_incrementer(); + while (len > 0) { - unsigned char *lastchar = (unsigned char *) (workstr + len - 1); - unsigned char savelastchar = *lastchar; + int charlen = 1; + unsigned char *lastchar; + Const *workstr_const; + + if (datatype != BYTEAOID) + charlen = len - pg_mbcliplen(workstr, len, len - 1); + + lastchar = (unsigned char *) (workstr + len - charlen); /* - * Try to generate a larger string by incrementing the last byte. + * Try to generate a larger string by incrementing the last byte or + * character. */ - while (*lastchar < (unsigned char) 255) - { - Const *workstr_const; - - (*lastchar)++; - if (datatype != BYTEAOID) - { - /* do not generate invalid encoding sequences */ - if (!pg_verifymbstr(workstr, len, true)) - continue; - workstr_const = string_to_const(workstr, datatype); - } - else + if (charincfunc(lastchar, charlen)) + { + if (datatype == BYTEAOID) workstr_const = string_to_bytea_const(workstr, len); + else + workstr_const = string_to_const(workstr, datatype); if (DatumGetBool(FunctionCall2Coll(ltproc, collation, @@ -5804,20 +5822,10 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) pfree(workstr_const); } - /* restore last byte so we don't confuse pg_mbcliplen */ - *lastchar = savelastchar; - /* - * Truncate off the last character, which might be more than 1 byte, - * depending on the character encoding. + * Truncate off the last character or byte. */ - if (datatype != BYTEAOID && pg_database_encoding_max_length() > 1) - len = pg_mbcliplen(workstr, len, len - 1); - else - len -= 1; - - if (datatype != BYTEAOID) - workstr[len] = '\0'; + len -= charlen; } /* Failed... */ diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c index f23732f..a213636 100644 --- a/src/backend/utils/mb/wchar.c +++ b/src/backend/utils/mb/wchar.c @@ -1336,6 +1336,264 @@ pg_utf8_islegal(const unsigned char *source, int length)/* *------------------------------------------------------------------- + * character incrementer + * + * These functions accept "charptr", a pointer to the first byte of a + * maybe-multibyte character. Try `increment' the character and return + * true if successed. If these functions returns false, the character + * should be untouched. These functions must be implemented in + * correspondence with verifiers, in other words, the rewrited + * character by this function must pass the check by pg_*_verifier() + * if returns true. + * ------------------------------------------------------------------- + */ + +#ifndef FRONTEND +static bool +pg_generic_charinc(unsigned char *charptr, int len) +{ + unsigned char *lastchar = (unsigned char *) (charptr + len - 1); + unsigned char savelastchar = *lastchar; + const char *const_charptr = (const char *)charptr; + + while (*lastchar < (unsigned char) 255) + { + (*lastchar)++; + if (!pg_verifymbstr(const_charptr, len, true)) + continue; + return true; + } + + *lastchar = savelastchar; + return false; +} + +/* + * Calculate the valid UTF-8 byte sequence that is the smallest of the + * greater than the given valid UTF-8 byte sequence for single + * character. The word `valid' here means that pg_utf8_islegal() says + * that it is valid. pg_utf8_increment does this roughly with the + * following algorithm. + * + * - If the length of the sequence is 1, the byte is between 0x01 and + * 0x7f. It is simplly incremented if less than 0x7f. + * + * - Otherwise the bytes after the first one ranges between 0x80 and + * 0xbf. The sequence is incremented from the last byte and the + * carry is propagated up to the first byte. + * + * Only the UTF-8 byte sequences corresponding to the following + * unicode point which have no character allocated make this function + * to fail in spite of its validity. + * + * U+0007F, U+007FF, U+0FFFF, U+10FFFF + * + */ +static bool +pg_utf8_increment(unsigned char *charptr, int length) +{ + unsigned char a; + unsigned char bak[4]; + unsigned char limit; + + switch (length) + { + default: + /* reject lengths 5 and 6 for now */ + return false; + case 4: + bak[3] = charptr[3]; + a = charptr[3]; + if (a < 0xBF) + { + charptr[3]++; + break; + } + charptr[3] = 0x80; + /* FALL THRU */ + case 3: + bak[2] = charptr[2]; + a = charptr[2]; + if (a < 0xBF) + { + charptr[2]++; + break; + } + charptr[2] = 0x80; + /* FALL THRU */ + case 2: + bak[1] = charptr[1]; + a = charptr[1]; + switch (*charptr) + { + case 0xED: + limit = 0x9F; + break; + + case 0xF4: + limit = 0x8F; + break; + + default: + limit = 0xBF; + + } + + if (a < limit) + { + charptr[1]++; + break; + } + + charptr[1] = 0x80; + /* FALL THRU */ + case 1: + bak[0] = *charptr; + a = *charptr; + if (a == 0x7F || a == 0xDF || a == 0xEF || a == 0xF4) + { + /* Rewinding modified bytes and return fail. length is + * confirmed to be between 1 and 4 here. */ + memcpy(charptr, bak, length); + return false; + } + charptr[0]++; + break; + } + + return true; +} + +/* Calculate the valid EUC-JP byte sequence that is the smallest of + * the greater than the given valid EUC-JP byte sequence for single + * character. The word `valid' here means that pg_eucjp_verifier() + * syas that it is valid. This function does this roughly with the + * following algorithm. + * + * - If the sequence starts with SS2(0x8e), it must be a two-byte + * sequence representing JIS X 0201 characters with the second byte + * ranges between 0xa1 and 0xde. It is simplly incremented at the + * second byte if less than 0xde, and otherwise rewrite whole the + * sequence to 0xa1 0xa1 as the result. + * + * - If the sequence starts with SS3(0x8f), it must be a three-byte + * sequence which the last two bytes ranges between 0xa1 and 0xfe. + * The bytes are incremented from the last within the range and + * propagate the carry up to the first byte. + * + * - If the sequence starts with the values other than the aboves and + * its MSB is set, it must be a two-byte sequence representing JIS + * X 0208 characters with both bytes ranges between 0xa1 and 0xfe. + * The bytes are incremented from the latter within the range and + * propagate the carry up to the first byte. + * + * - Otherwise the sequence is consists of single byte representing + * ASCII characters. It is incremented up to 0x7f. + * + * Only three EUC-JP byte sequences shown below which has no character + * allocated make this function to fail in spite of its validity. + * + * 0x7f, 0xfe 0xfe, 0x8f 0xfe 0xfe + * + */ +static bool +pg_eucjp_increment(unsigned char *charptr, int length) +{ + unsigned char bak[3]; + unsigned char c1, c2; + signed int i; + + c1 = *charptr; + + switch (c1) + { + case SS2: /* JIS X 0201 */ + if (length != 2) return false; + + c2 = charptr[1]; + + if (c2 > 0xde) + charptr[0] = charptr[1] = 0xa1; + else if (c2 < 0xa1) + charptr[1] = 0xa1; + else + charptr[1]++; + + break; + + case SS3: /* JIS X 0212 */ + if (length != 3) return false; + + for (i = 2 ; i > 0 ; i--) + { + bak[i] = charptr[i]; + c2 = charptr[i]; + if (c2 < 0xa1) + { + charptr[i] = 0xa1; + return true; + } + else if (c2 < 0xfe) + { + charptr[i]++; + break; + } + charptr[i] = 0xa1; + } + + + if (i == 0) /* Out of 3-byte code region */ + { + charptr[1] = bak[1]; + charptr[2] = bak[2]; + return false; + } + + break; + + default: + if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */ + { + if (length != 2) return false; + + for (i = 1 ; i >= 0 ; i--) /* i must be signed */ + { + bak[i] = charptr[i]; + c2 = charptr[i]; + if (c2 < 0xa1) + { + charptr[i] = 0xa1; + return true; + } + else if (c2 < 0xfe) + { + charptr[i]++; + break; + } + charptr[i] = 0xa1; + } + + if (i < 0) /* Out of 2 byte code region */ + { + charptr[0] = bak[0]; + charptr[1] = bak[1]; + return false; + } + } + else + { /* ASCII, single byte */ + if (c1 > 0x7e) + return false; + (*charptr)++; + } + } + + return true; +} +#endif + +/* + *------------------------------------------------------------------- * encoding info table * XXX must be sorted by thesame order as enum pg_enc (in mb/pg_wchar.h) *------------------------------------------------------------------- @@ -1459,6 +1717,25 @@ pg_database_encoding_max_length(void)}/* + * give the character incrementer for the encoding for the current database + */ +character_incrementer +pg_database_encoding_character_incrementer(void) +{ + switch (GetDatabaseEncoding()) + { + case PG_UTF8: + return pg_utf8_increment; + + case PG_EUC_JP: + return pg_eucjp_increment; + + default: + return pg_generic_charinc; + } +} + +/* * Verify mbstr to make sure that it is validly encoded in the current * database encoding. Otherwise same as pg_verify_mbstr().*/ diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 826c7af..728175c 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -284,6 +284,8 @@ typedef int (*mblen_converter) (const unsigned char *mbstr);typedef int (*mbdisplaylen_converter) (constunsigned char *mbstr); +typedef bool (*character_incrementer) (unsigned char *mbstr, int len); +typedef int (*mbverifier) (const unsigned char *mbstr, int len);typedef struct @@ -389,6 +391,7 @@ extern int pg_encoding_mbcliplen(int encoding, const char *mbstr,extern int pg_mbcharcliplen(constchar *mbstr, int len, int imit);extern int pg_encoding_max_length(int encoding);extern int pg_database_encoding_max_length(void); +extern character_incrementer pg_database_encoding_character_incrementer(void);extern int PrepareClientEncoding(int encoding);externint SetClientEncoding(int encoding);
On Wed, Oct 12, 2011 at 11:45 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@oss.ntt.co.jp> wrote: > Hello, the work is finished. > > Version 4 of the patch is attached to this message. I went through this in a bit more detail tonight and am cleaning it up. But I'm a bit confused, looking at pg_utf8_increment() in detail: - Why does the second byte need special handling for 0xED and 0xF4? AFAICT, UTF-8 requires all legal strings to have a second byte between 0x80 and 0xBF, just as in byte positions 3 and 4, so these bytes would be invalid in this position anyway. - In the first byte, we don't increment if the current value for that byte is 0x7F, 0xDF, 0xEF, or 0xF4. But why isn't it 0xF7 rather than 0xF4? I see there's a comparable restriction in pg_utf8_islegal(), but I don't understand why. - Perhaps on the same point, the comments claim that we will fail for code points U+0007F, U+007FF, U+0FFFF, and U+10FFFF. But IIUC, a 4-byte unicode character can encode values up to U+1FFFFF, so why is it U+10FFFF rather than U+1FFFFF? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > - Why does the second byte need special handling for 0xED and 0xF4? http://www.faqs.org/rfcs/rfc3629.html See section 4 in particular. The underlying requirement is to disallow multiple representations of the same Unicode code point. regards, tom lane
On Mon, Oct 17, 2011 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> - Why does the second byte need special handling for 0xED and 0xF4? > > http://www.faqs.org/rfcs/rfc3629.html > > See section 4 in particular. The underlying requirement is to disallow > multiple representations of the same Unicode code point. I'm still confused. The input string is already known to be valid UTF-8, so the second byte (if there is one) must be between 0x80 and 0xBF. Therefore it will be neither 0xED nor 0xF4. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Oct 17, 2011 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> http://www.faqs.org/rfcs/rfc3629.html > I'm still confused. The input string is already known to be valid > UTF-8, so the second byte (if there is one) must be between 0x80 and > 0xBF. Therefore it will be neither 0xED nor 0xF4. I haven't read the patch lately, but ED and F4 are special as *first* bytes. Maybe the logic isn't quite right, or you read it wrong? regards, tom lane
On Tue, Oct 18, 2011 at 12:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Oct 17, 2011 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> http://www.faqs.org/rfcs/rfc3629.html > >> I'm still confused. The input string is already known to be valid >> UTF-8, so the second byte (if there is one) must be between 0x80 and >> 0xBF. Therefore it will be neither 0xED nor 0xF4. > > I haven't read the patch lately, but ED and F4 are special as *first* > bytes. Maybe the logic isn't quite right, or you read it wrong? I think I'll let the patch author comment on that. It looks wrong to me, but I just work here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [v9.2] make_greater_string() does not return a string in some cases
From
Kyotaro HORIGUCHI
Date:
Hello, > > Robert Haas <robertmhaas@gmail.com> writes: > >> - Why does the second byte need special handling for 0xED and 0xF4? > > > > http://www.faqs.org/rfcs/rfc3629.html > > > > See section 4 in particular. The underlying requirement is to disallow > > multiple representations of the same Unicode code point. The special handling skips the utf8 code regions corresponds to the regions U+D800 - U+DFFF and U+110000 - U+11ffff in ucs-4. The former is reserved for use with the UTF-16 encoding form as surrougate pairs and do not directly represent characters as described in section 3 of rfc3629. The latter is the region which is out of the utf-8 range by the definition described also in the same section. former> The definition of UTF-8 prohibits encoding character former> numbers between U+D800 and U+DFFF, which are reserved for former> use with the UTF-16 encoding form (as surrogate pairs) former> and do not directly represent characters. latter> In UTF-8, characters from the U+0000..U+10FFFF range (the latter> UTF-16 accessible range) are encoded using sequences of 1 latter> to 4 octets. # However, I wrote this exception simplly mimicked the # pg_utf8_validator()'s behavior at the beginning. This must be the basis of the behavior of pg_utf8_verifier(), and pg_utf8_increment() has taken over it. It may be good to describe this origin of the special handling as comment of these functions to avoid this sort of confusion. > I'm still confused. The input string is already known to be valid > UTF-8, so the second byte (if there is one) must be between 0x80 and > 0xBF. Therefore it will be neither 0xED nor 0xF4. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Oct 20, 2011 at 9:36 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@oss.ntt.co.jp> wrote: > This must be the basis of the behavior of pg_utf8_verifier(), and > pg_utf8_increment() has taken over it. It may be good to describe > this origin of the special handling as comment of these functions > to avoid this sort of confusion. Oh, you know what? I'm misreading this code. *facepalm* I thought that code was looking for 0xED/0xF4 in the second position, but it's actually looking for them in the first position, which makes vastly more sense. Whee! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [v9.2] make_greater_string() does not return a string in some cases
From
horiguchi.kyotaro@oss.ntt.co.jp
Date:
Hello, I feel at a loss what to do... > I thought that code was looking for 0xED/0xF4 in the second position, > but it's actually looking for them in the first position, which makes > vastly more sense. Whee! Anyway, I try to describe another aspect of this code a the present. The switch block in the g_utf8_increnet is a folded code of five individual manipulation according to the byte-length of the sequence. The separation presupposes the input bytes and length formes a valid utf-8 sequence. For a character more than 5 byte length, retunes false. For 4 bytes, the sequence ranges between U+10000 and U+1fffff. If charptr[3] is less than 0xbf, increment it and return true. Else assign 0x80 to charptr[3] and then if charptr[2] is less than 0xbf increment it and return true. Else assign 0x80 to charptr[2] and then, if (charptr[1] is less than 0x8f when charptr[0] == 0xf4) or (charptr[1]is less than 0xbf when charptr[0] != 0xf4) increment it and return true. Else assign 0x80 to charptr[1] and then if charptr[0] is not 0xf4 increment it and return true. Else the input sequence must be 0xf4 0x8f 0xbf 0xbf which represents U+10ffff and this is the upper limit of UTF-8 representation.Restore the sequnce and return false. for 3 bytes, the sequence ranges between u+800 and u+ffff. If charptr[2] is less than 0xbf increment it and reutrn true. Else assign 0x80 to charptr[2] and then, if (charptr[1] is less than 0x9f when charptr[0] == 0xed) or (charptr[1]is less than 0xbf when charptr[0] != 0xed) increment it and return true. The sequence 0xed 0x9f 0xbf represents U+d7ff will incremented to 0xef 0x80 0x80 (U+f000) at the end. Else assign 0x80 to charptr[1] and then if charptr[0] is not 0xef increment it and return true. Else the input sequence must be 0xef 0xbf 0xbf which represents U+ffff and the next UTF8 sequence has the length of 4. Restorethe sequnce and return false. For 2 bytes, the sequence ranges between U+80 and U+7ff. If charptr[1] is less than 0xbf increment it and reutrn true. Else assign 0x80 to charptr[1] and then if charptr[0] is not 0xdf increment it and return true. Else the input sequence must be 0xdf 0xbf which reporesents U+7ff and next UTF8 sequence has the length of 3. Restore thesequence and return false. For 1 byte, the byte ranges between U+0 and U+7f. If charptr[0] is less than 0x7f increment it and return true. Else the input sequence must be 0x7f which represents U+7f and next UTF8 sequence has the length of 2. Restore the sequenceand return false. -- Kyotaro Horiguchi
On Sat, Oct 29, 2011 at 1:16 PM, <horiguchi.kyotaro@oss.ntt.co.jp> wrote: > Hello, I feel at a loss what to do... > >> I thought that code was looking for 0xED/0xF4 in the second position, >> but it's actually looking for them in the first position, which makes >> vastly more sense. Whee! > > Anyway, I try to describe another aspect of this code a the present. I've committed this, after a good deal of hacking on the comments, some coding style cleanup, and one bug fix: + workstr[len] = '\0'; Without that, when we start fiddling with the second-to-last character, the last one is still hanging around, which is different than the old behavior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I've committed this, after a good deal of hacking on the comments, > some coding style cleanup, and one bug fix: Ummm ... why do the incrementer functions think they need to restore the previous value on failure? AFAICS that's a waste of code and cycles, since there is only one caller and it doesn't care in the least. I'm also quite distressed that you ignored my advice to limit the number of combinations tried. This patch could be horribly slow when dealing with wide characters, eg think what will happen when starting from U+10000. regards, tom lane
On Sat, Oct 29, 2011 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I've committed this, after a good deal of hacking on the comments, >> some coding style cleanup, and one bug fix: > > Ummm ... why do the incrementer functions think they need to restore the > previous value on failure? AFAICS that's a waste of code and cycles, > since there is only one caller and it doesn't care in the least. Well, it might not be strictly necessary for pg_utf8_increment() and pg_eucjp_increment(), but it's clearly necessary for the generic incrementer function for exactly the same reason it was needed in the old coding. I suppose we could weaken the rule to "you must leave a valid character behind rather than a bunch of bytes that doesn't encode to a character", but the cycle savings are negligible and the current rule seems both simpler and more bullet-proof. > I'm also quite distressed that you ignored my advice to limit the number > of combinations tried. This patch could be horribly slow when dealing > with wide characters, eg think what will happen when starting from > U+10000. Uh, I think it will try at most one character in that position and then truncate away that character entirely, per my last email on this topic (to which you never responded): http://archives.postgresql.org/pgsql-hackers/2011-09/msg01195.php -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Oct 29, 2011 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Ummm ... why do the incrementer functions think they need to restore the >> previous value on failure? AFAICS that's a waste of code and cycles, >> since there is only one caller and it doesn't care in the least. > Well, it might not be strictly necessary for pg_utf8_increment() and > pg_eucjp_increment(), but it's clearly necessary for the generic > incrementer function for exactly the same reason it was needed in the > old coding. I suppose we could weaken the rule to "you must leave a > valid character behind rather than a bunch of bytes that doesn't > encode to a character", but the cycle savings are negligible and the > current rule seems both simpler and more bullet-proof. No, it's *not* necessary any more, AFAICS. make_greater_string discards the data and overwrites it with a null instantly upon getting a failure return from the incrementer. The reason we used to need it was that we did pg_mbcliplen after failing to increment, but now we do that before we ever increment anything, so we already know the length of the last character. It doesn't matter whether those bytes are still valid or contain garbage. >> I'm also quite distressed that you ignored my advice to limit the number >> of combinations tried. �This patch could be horribly slow when dealing >> with wide characters, eg think what will happen when starting from >> U+10000. > Uh, I think it will try at most one character in that position and > then truncate away that character entirely, per my last email on this > topic (to which you never responded): > http://archives.postgresql.org/pgsql-hackers/2011-09/msg01195.php Oh! You are right, I was expecting it to try multiple characters at the same position before truncating the string. This change seems to have lobotomized things rather thoroughly. What is the rationale for that? As an example, when dealing with a single-character string, it will fail altogether if the next code value sorts out-of-order, so this seems to me to be a rather large step backwards. I think we ought to go back to the previous design of incrementing till failure and then truncating, which puts the onus on the incrementer to make a reasonable tradeoff of how many combinations to try per character position. There's a simple tweak we could make to the patch to limit that: once we've maxed out a lower-order byte of a multibyte char, *don't reset it to minimum*, just move on to incrementing the next higher byte. This preserves the old property that the maximum number of combinations tried is bounded by 256 * string's length in bytes. regards, tom lane
On Sat, Oct 29, 2011 at 4:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, it might not be strictly necessary for pg_utf8_increment() and >> pg_eucjp_increment(), but it's clearly necessary for the generic >> incrementer function for exactly the same reason it was needed in the >> old coding. I suppose we could weaken the rule to "you must leave a >> valid character behind rather than a bunch of bytes that doesn't >> encode to a character", but the cycle savings are negligible and the >> current rule seems both simpler and more bullet-proof. > > No, it's *not* necessary any more, AFAICS. make_greater_string discards > the data and overwrites it with a null instantly upon getting a failure > return from the incrementer. The reason we used to need it was that we > did pg_mbcliplen after failing to increment, but now we do that before > we ever increment anything, so we already know the length of the last > character. It doesn't matter whether those bytes are still valid or > contain garbage. Oh, dude. I think you are right. I guess we can rip that crap out then. That's much cleaner. >>> I'm also quite distressed that you ignored my advice to limit the number >>> of combinations tried. This patch could be horribly slow when dealing >>> with wide characters, eg think what will happen when starting from >>> U+10000. > >> Uh, I think it will try at most one character in that position and >> then truncate away that character entirely, per my last email on this >> topic (to which you never responded): >> http://archives.postgresql.org/pgsql-hackers/2011-09/msg01195.php > > Oh! You are right, I was expecting it to try multiple characters at the > same position before truncating the string. This change seems to have > lobotomized things rather thoroughly. What is the rationale for that? > As an example, when dealing with a single-character string, it will fail > altogether if the next code value sorts out-of-order, so this seems to > me to be a rather large step backwards. > > I think we ought to go back to the previous design of incrementing till > failure and then truncating, which puts the onus on the incrementer to > make a reasonable tradeoff of how many combinations to try per character > position. There's a simple tweak we could make to the patch to limit > that: once we've maxed out a lower-order byte of a multibyte char, > *don't reset it to minimum*, just move on to incrementing the next > higher byte. This preserves the old property that the maximum number of > combinations tried is bounded by 256 * string's length in bytes. On this point I believe you are still confused. The old code tried one character per position, and the new code tries one character per position. Nothing has been lobotomized in any way. The difference is that the old code used a "guess and check" approach to generate the character, so there was an inner loop that was trying to generate a character (possibly generating various garbage strings that did not represent a character along the way) and then, upon success, checked the sort order of that single string before truncating and retrying. The new code does exactly the same thing in the outer loop - i.e. truncate one character per iteration - but the inner loop has, at least for UTF-8 and EUC-JP, been replaced with an algorithm that is guaranteed to produce a valid character without needing to loop. Now having said that, I think there is a possibility for some improvement here. If we know we're not going to spend a lot of time uselessly screwing around trying to get something that will pass pg_verifymbstr(), then we could probably afford to call ltproc several times per position, rather than just once. But that's not restoring the behavior of the old algorithm; that's improving on the old algorithm. And before we do that, we need to think about a couple of things: first, that silly looping behavior is still there for anything other than UTF-8 and EUC-JP. Until we have constant-time increment functions for every encoding we support, we probably don't want to get too jiggy with it; second, we need to convince ourselves that this will succeed in a meaningful number of cases where the current algorithm fails. I played around a bit with the UTF-8 case (with collation = en_US.UTF-8) before committing this and I suspect that trying 4 or 5 characters per position could be a win - you might for example be looking at something like an accented E, and you might have several different versions of E in a row before you get to something that's no longer E-like. If you get to that point and still don't find something that compares favorably, it's probably time to throw in the towel. Another idea would be to do the first increment by one, and then increment by two, four, eight before giving up, or something like that. It probably needs some research, and frankly I'm happy to leave it to someone who is having a real-world problem with it. The fact that we haven't gotten any complaints before suggests that this actually works decently well as it stands. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Oct 29, 2011 at 4:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Oh! You are right, I was expecting it to try multiple characters at the >> same position before truncating the string. This change seems to have >> lobotomized things rather thoroughly. What is the rationale for that? >> As an example, when dealing with a single-character string, it will fail >> altogether if the next code value sorts out-of-order, so this seems to >> me to be a rather large step backwards. > On this point I believe you are still confused. The old code tried > one character per position, and the new code tries one character per > position. Nothing has been lobotomized in any way. No, on this point you are flat out wrong. Try something like select ... where f1 like 'p%'; in tt_RU locale, wherein 'q' sorts between 'k' and 'l'. The old code correctly found that 'r' works as a string greater than 'p'. The new code fails to find a greater string, because it only tries 'q' and then gives up. This results in a selectivity estimate much poorer than necessary. Since the stated purpose of this patch is to fix some corner cases where the code fails to find a greater string, I fail to see why it's acceptable to introduce some other corner cases that weren't there before. > The difference is > that the old code used a "guess and check" approach to generate the > character, so there was an inner loop that was trying to generate a > character (possibly generating various garbage strings that did not > represent a character along the way) and then, upon success, checked > the sort order of that single string before truncating and retrying. You are misreading the old code. The inner loop increments the last byte, checks for success, and if it hasn't produced a greater string then it loops around to increment again. > The fact that we haven't gotten any complaints before suggests that this > actually works decently well as it stands. Well, that's true of the old algorithm ;-) I had likewise thought of the idea of trying some fixed number of character values at each position, but it's unclear to me why that's better than allowing an encoding-specific behavior. I don't believe that we could get away with trying less than a few dozen values, though. For example, in a situation where case sensitivity is relevant, you might need to increment past all the upper-case letters to get to a suitable lower-case letter. I also think that it's probably useful to try incrementing higher-order bytes of a multibyte character before giving up --- we just can't afford to do an exhaustive search. Thus my proposal to let the low-order bytes max out but not cycle. regards, tom lane
On Sun, Oct 30, 2011 at 10:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > You are misreading the old code. The inner loop increments the last > byte, checks for success, and if it hasn't produced a greater string > then it loops around to increment again. Ugh. You're right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company