Re: Explicit deterministic COLLATE fails with pattern matchingoperations on column with non-deterministic collation - Mailing list pgsql-bugs

From James Lucas
Subject Re: Explicit deterministic COLLATE fails with pattern matchingoperations on column with non-deterministic collation
Date
Msg-id CAAFmbbPn6B+jMJ-DTkMnNoKkJtgx_QTDpL-KqXKsGjsFA0936w@mail.gmail.com
Whole thread Raw
In response to Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
Thanks Tom,

This is too much into the guts of the planner for me to contribute
much.  This does raise an interesting point that I had not considered
though - it sounds like column statistics depend on the default
collation on the column.  That could impact the plans chosen for
future queries, even if those queries are performed using a different
collation.  For most deterministic collations I expect that probably
doesn't make much of a difference, but for non-deterministic
collations it seems like the difference in stats could be significant.
N_distinct, in particular, seems like it might be very different for a
non-deterministic collation.

I tried setting up a pathological test case for this, and it seems
like at least currently, even with a non-deterministic collation
statistics still count values as distinct, even if the default
collation would consider them equivalent.  Not sure if that's as
intended or not?

create table stest (id numeric, t text);
create table stestnd (id numeric, t text collate mycollation);
insert into stest select generate_series(1,50000),'aaa';
insert into stest select generate_series(50001,100000),'aAa';
insert into stest select generate_series(100001,150000),'bbb';
insert into stest select generate_series(150001,200000),'bBb';
insert into stest select generate_series(200001,250000),'ccc';
insert into stest select generate_series(250001,300000),'cCc';
insert into stestnd select * from stest;
analyze stest, stestnd;

select schemaname, tablename, attname, n_distinct, most_common_vals
from pg_stats where attname='t' and tablename like 'stest%' order by
tablename;
 schemaname | tablename | attname | n_distinct |     most_common_vals
------------+-----------+---------+------------+---------------------------
 public     | stest     | t       |          6 | {aAa,cCc,bbb,aaa,ccc,bBb}
 public     | stestnd   | t       |          6 | {bBb,ccc,bbb,aAa,cCc,aaa}


Actually it turns out the DISTINCT clause doesn't either:

select count(*) from (select distinct t from stest) s;
 count
-------
     6
select count(*) from (select distinct t from stestnd) s;
 count
-------
     6

Sorry - don't want to derail the question at hand too much.  It seems
like it might be relevant if the discussion is around stats and
collation handling.

Thanks,
James

On Wed, May 27, 2020 at 7:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> James Lucas <jlucasdba@gmail.com> writes:
> > explain select * from ctestnd where t like 'a%' collate "C";
> > ERROR:  nondeterministic collations are not supported for LIKE
>
> Yeah.  I traced through this, and the place where it's failing is where
> the planner tries to apply the LIKE operator to the stored MCV values
> (to see how many of them pass the condition, which gives us a big clue
> about the selectivity).  Unfortunately, per the comments in selfuncs.c,
>
>  * For both oprrest and oprjoin functions, the operator's input collation OID
>  * (if any) is passed using the standard fmgr mechanism, so that the estimator
>  * function can fetch it with PG_GET_COLLATION().  Note, however, that all
>  * statistics in pg_statistic are currently built using the relevant column's
>  * collation.  Thus, in most cases where we are looking at statistics, we
>  * should ignore the operator collation and use the stats entry's collation.
>  * We expect that the error induced by doing this is usually not large enough
>  * to justify complicating matters.  In any case, doing otherwise would yield
>  * entirely garbage results for ordered stats data such as histograms.
>
> mcv_selectivity is following this advice and applying LIKE with the
> ctestnd.t column's declared collation ... and then the operator throws
> an error.
>
> The idea that using the "wrong" collation might actually cause an error
> was not factored into this design, obviously.  I'm not sure offhand what
> to do about it.  If we go over to using the query's collation then we
> avoid that issue, but instead we have the problem noted in this comment
> about the histogram sort order not matching what the operator expects.
> (In the case of mcv_selectivity the sort order isn't really an issue,
> but it is an issue for sibling functions such as
> ineq_histogram_selectivity.)
>
> This issue only dates back to commit 5e0928005; before that, we just
> blindly passed DEFAULT_COLLATION_OID to operators being evaluated for
> estimation purposes.  (I suppose if you made the database's default
> collation nondeterministic, you could still get into trouble; but that
> case may not be reachable right now.)  On the other hand, the actual
> breakage is even newer, because nondeterministic collations weren't
> added until 5e1963fb7, several months later.  Both of those are v12
> cycle, so it's academic from a user's standpoint which one we blame;
> but the upshot is that this case doesn't work.
>
> Ideally, no operator would ever throw an error about unsupported
> collations, but I suppose that day is far away.
>
> I guess the path of least resistance is to change the selectivity
> functions to use the query's collation; then, if you get an error
> here you would have done so at runtime anyway.  The problem of
> inconsistency with the histogram collation will be real for
> ineq_histogram_selectivity; but we had a variant of that before,
> in that always using DEFAULT_COLLATION_OID would give answers
> that were wrong for a query using a different collation.
>
> Peter, any other thoughts?
>
>                         regards, tom lane



pgsql-bugs by date:

Previous
From: Joe Conway
Date:
Subject: Re: BUG #16466: Valgrind detects an invalid read in dblink_open()with a cursor inside a transaction
Next
From: Tom Lane
Date:
Subject: Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation