Home > mailing lists

Re: BUG #15285: Query used index over field with ICU collation insome cases wrongly return 0 rows - Mailing list pgsql-bugs

From	Jehan-Guillaume de Rorthais
Subject	Re: BUG #15285: Query used index over field with ICU collation insome cases wrongly return 0 rows
Date	June 12, 2020 16:40:55
Msg-id	20200612184055.205f0159@firost Whole thread
In response to	Re: BUG #15285: Query used index over field with ICU collation insome cases wrongly return 0 rows (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses	Re: BUG #15285: Query used index over field with ICU collation insome cases wrongly return 0 rows
List	pgsql-bugs

Tree view

On Wed, 10 Jun 2020 00:29:33 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
[...]
> After playing with ICU regression tests, I found functions ucol_strcollIter
> and ucol_nextSortKeyPart are safe. I'll do some performance tests and report
> here.

I did some benchmarks. See attachment for the script and its header to
reproduce.

It sorts 935895 french phrases from 0 to 122 chars with an average of 49.
Performance tests were done on current master HEAD (buggy) and using the patch
in attachment, relying on ucol_strcollIter.

My preliminary test with ucol_getSortKey was catastrophic, as we might
expect. 15-17x slower than the current HEAD. So I removed it from actual tests.
I didn't try with ucol_nextSortKeyPart though.

Using ucol_strcollIter performs ~20% slower than HEAD on UTF8 databases, but
this might be acceptable. Here are the numbers:

   DB Encoding   HEAD  strcollIter   ratio
   UTF8          2.74         3.27   1.19x
   LATIN1        5.34         5.40   1.01x

I plan to add a regression test soon.

> In the meantime, I've been working on various workarounds. The only one I
> found is to use "fr-u-kr-latn-digit-kn" instead of "fr-u-kr-latn-digit".
> Unfortunately, the two collations are not equivalent, but I believe it might
> be useful in many case.
> 
> I've been working on a second workaround: creating a type (a char variant for
> our usecase), its operators and opfamily. All operators and function 1 relies
> on ucol_getSortKey. Most of the workaround works good but surprisingly, the
> sort order is only enforced if the field is in the first position:
> 
>   * this works: "SORT BY f1 COLLATE digitslast"
>   * this fails: "SORT BY f2, f1 COLLATE digitslast"

I fixed this. I didn't declare my opclass as default for the type I created.
I'm not sure people would like to see/discuss this user workaround here?

Regards,

Attachment

pgsql-bugs by date:

From: baki baki
Date: 12 June 2020, 14:22:11
Subject: Re: BUG #16488: psql installation initdb

From: PG Bug reporting form
Date: 12 June 2020, 17:07:09
Subject: BUG #16491: PostgreSQL will not install unless a local account is used

Re: BUG #15285: Query used index over field with ICU collation insome cases wrongly return 0 rows - Mailing list pgsql-bugs

Attachment

Previous

Next