Re: [v9.2] make_greater_string() does not return a string in some cases - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: [v9.2] make_greater_string() does not return a string in some cases
Date
Msg-id 20110922.203029.185577421.horiguchi.kyotaro@oss.ntt.co.jp
Whole thread Raw
In response to Re: [v9.2] make_greater_string() does not return a string in some cases  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Thank you for your understanding on that point.

At Wed, 21 Sep 2011 20:35:02 -0400, Robert Haas <robertmhaas@gmail.com> wrote
> ...while Kyotaro Horiguchi clearly feels otherwise, citing the
> statistic that about 100 out of 7000 Japanese characters fail to work
> properly:
> 
> http://archives.postgresql.org/pgsql-bugs/2011-07/msg00064.php
> 
> That statistic seems to justify some action, but what?  Ideas:

Addition to the figures - based on whole characters defined in
JIS X 0208 which is traditionally (It is becoming a history now.)
for information exchange in Japan - narrowing to commonly-used
characters (named `Jouyou-Kanji' in Japanese, to be learned by
high school graduates in Japan), 35 out of 2100 hits.

# On the other hand, widening to JIS X 0213 which is roughly
# compatible with the Unicode, and defines more than 12K chars, I
# have not counted, but the additional 5k characters can be
# assumed to have less probability to fail than the chars in JIS
# X 0208.


> 1. Adopt the patch as proposed, or something like it.
> 2. Instead of installing encoding-specific character incrementing
> functions, we could try to come up with a more reliable generic
> algorithm.  Not sure exactly what, though.
> 3. Come up with some way to avoid needing to do this in the first place.
> 
> One random idea I have is - instead of generating > and < clauses,
> could we define a "prefix match" operator - i.e. a ### b iff substr(a,
> 1, length(b)) = b?  We'd need to do something about the selectivity,
> but I don't see why that would be a problem.
> 
> Thoughts?

I am a newbie for PostgreSQL, but from a general view, I think
that the most radical and clean way to fix this behavior is to
make indexes to have the forward-matching function for strings in
itself, with ignoreing possible overheads I don't know.  This can
save the all failures this patch has left unsaved, assuming that
the `greater string' is not necessary to be a `valid string' just
on searching btree.

Another idea that I can guess is to add a new operator that means
"examine if the string value is smaller than the `greater string'
of the parameter.". This operator also can defer making `greater
string' to just before searching btree or summing up histogram
entries, or comparison with column values. If the assumption
above is true, "making greater string" operation can be done in
regardless of character encoding. This seems have smaller impact
than "prefix match" operator.

# But, mmm, The more investigating, the less difference it seems
# for me to be... But It is out of my knowledge now, anyway.. I
# need more study.



On the other hand, if no additional encoding-specific `character
increment function' will not come out, the modification of
pg_wchar_table can be cancelled and make_greater_string will
select the `character increment function' as 'switch
(GetDatabaseEncoding()) { case PG_UTF8:.. }'.  This get rid of
the pg_generic_charinc tweak for libpq too.



At Wed, 21 Sep 2011 21:49:27 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote
> detail work; for instance, I noted an unconstrained memcpy into a 4-byte
> local buffer, as well as lots and lots of violations of PG house style.
> That's certainly all fixable but somebody will have to go through it.

Sorry for the illegal style of the patch. I will confirm it.


Regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Double sorting split patch
Next
From: Alexander Korotkov
Date:
Subject: Re: Double sorting split patch