Thread: BUG #15285: Query used index over field with ICU collation in some cases wrongly return 0 rows
BUG #15285: Query used index over field with ICU collation in some cases wrongly return 0 rows
From
Jehan-Guillaume de Rorthais
Date:
Hi, I'm bumping this thread on pgsql-hacker, hopefully it will drag some more opinions/discussions. Should we try to fix this issue or not? This is clearly an upstream bug. It has been reported, including regression tests, but this doesn't move since 2 years now. If we choose not to fix it on our side using eg a workaround (see patch), I suppose this small bug should be documented somewhere so people are not lost alone in the wild. Opinions? Regards, Begin forwarded message: Date: Sat, 13 Jun 2020 00:43:22 +0200 From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> To: Thomas Munro <thomas.munro@gmail.com>, Peter Geoghegan <pg@bowt.ie> Cc: Роман Литовченко <roman.lytovchenko@gmail.com>, PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org> Subject: Re: BUG #15285: Query used index over field with ICU collation in some cases wrongly return 0 rows On Fri, 12 Jun 2020 18:40:55 +0200 Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > On Wed, 10 Jun 2020 00:29:33 +0200 > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > [...] > > After playing with ICU regression tests, I found functions ucol_strcollIter > > and ucol_nextSortKeyPart are safe. I'll do some performance tests and report > > here. > > I did some benchmarks. See attachment for the script and its header to > reproduce. > > It sorts 935895 french phrases from 0 to 122 chars with an average of 49. > Performance tests were done on current master HEAD (buggy) and using the patch > in attachment, relying on ucol_strcollIter. > > My preliminary test with ucol_getSortKey was catastrophic, as we might > expect. 15-17x slower than the current HEAD. So I removed it from actual > tests. I didn't try with ucol_nextSortKeyPart though. > > Using ucol_strcollIter performs ~20% slower than HEAD on UTF8 databases, but > this might be acceptable. Here are the numbers: > > DB Encoding HEAD strcollIter ratio > UTF8 2.74 3.27 1.19x > LATIN1 5.34 5.40 1.01x > > I plan to add a regression test soon. Please, find in attachment the second version of the patch, with a regression test. Regards, -- Jehan-Guillaume de Rorthais Dalibo