Re: [v9.2] make_greater_string() does not return a string in some cases - Mailing list pgsql-hackers
From | Kyotaro HORIGUCHI |
---|---|
Subject | Re: [v9.2] make_greater_string() does not return a string in some cases |
Date | |
Msg-id | 20110922.203029.185577421.horiguchi.kyotaro@oss.ntt.co.jp Whole thread Raw |
In response to | Re: [v9.2] make_greater_string() does not return a string in some cases (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
Thank you for your understanding on that point. At Wed, 21 Sep 2011 20:35:02 -0400, Robert Haas <robertmhaas@gmail.com> wrote > ...while Kyotaro Horiguchi clearly feels otherwise, citing the > statistic that about 100 out of 7000 Japanese characters fail to work > properly: > > http://archives.postgresql.org/pgsql-bugs/2011-07/msg00064.php > > That statistic seems to justify some action, but what? Ideas: Addition to the figures - based on whole characters defined in JIS X 0208 which is traditionally (It is becoming a history now.) for information exchange in Japan - narrowing to commonly-used characters (named `Jouyou-Kanji' in Japanese, to be learned by high school graduates in Japan), 35 out of 2100 hits. # On the other hand, widening to JIS X 0213 which is roughly # compatible with the Unicode, and defines more than 12K chars, I # have not counted, but the additional 5k characters can be # assumed to have less probability to fail than the chars in JIS # X 0208. > 1. Adopt the patch as proposed, or something like it. > 2. Instead of installing encoding-specific character incrementing > functions, we could try to come up with a more reliable generic > algorithm. Not sure exactly what, though. > 3. Come up with some way to avoid needing to do this in the first place. > > One random idea I have is - instead of generating > and < clauses, > could we define a "prefix match" operator - i.e. a ### b iff substr(a, > 1, length(b)) = b? We'd need to do something about the selectivity, > but I don't see why that would be a problem. > > Thoughts? I am a newbie for PostgreSQL, but from a general view, I think that the most radical and clean way to fix this behavior is to make indexes to have the forward-matching function for strings in itself, with ignoreing possible overheads I don't know. This can save the all failures this patch has left unsaved, assuming that the `greater string' is not necessary to be a `valid string' just on searching btree. Another idea that I can guess is to add a new operator that means "examine if the string value is smaller than the `greater string' of the parameter.". This operator also can defer making `greater string' to just before searching btree or summing up histogram entries, or comparison with column values. If the assumption above is true, "making greater string" operation can be done in regardless of character encoding. This seems have smaller impact than "prefix match" operator. # But, mmm, The more investigating, the less difference it seems # for me to be... But It is out of my knowledge now, anyway.. I # need more study. On the other hand, if no additional encoding-specific `character increment function' will not come out, the modification of pg_wchar_table can be cancelled and make_greater_string will select the `character increment function' as 'switch (GetDatabaseEncoding()) { case PG_UTF8:.. }'. This get rid of the pg_generic_charinc tweak for libpq too. At Wed, 21 Sep 2011 21:49:27 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote > detail work; for instance, I noted an unconstrained memcpy into a 4-byte > local buffer, as well as lots and lots of violations of PG house style. > That's certainly all fixable but somebody will have to go through it. Sorry for the illegal style of the patch. I will confirm it. Regards, -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: