Thread: Hash join not finding which collation to use for string hashing

Hash join not finding which collation to use for string hashing

From
Mark Dilger
Date:
While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding to
thatpatch, I’m starting this new thread. 

I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash
partitionvs UPDATE” thread.  As such, I’ve included Tom and Amit in the CC list. 

Notice the "ERROR:  could not determine which collation to use for string hashing”

The following is extracted from the output from the test:

> CREATE TABLE raw_data (a text);
> INSERT INTO raw_data (a) VALUES ('Türkiye'),
>                                 ('TÜRKIYE'),
>                                 ('bıt'),
>                                 ('BIT'),
>                                 ('äbç'),
>                                 ('ÄBÇ'),
>                                 ('aaá'),
>                                 ('coté'),
>                                 ('Götz'),
>                                 ('ὀδυσσεύς'),
>                                 ('ὈΔΥΣΣΕΎΣ'),
>                                 ('を読み取り用'),
>                                 ('にオープンできませんでした');
> -- Create unpartitioned tables for test
> CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
> CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");
> INSERT INTO alpha (SELECT a, a FROM raw_data);
> INSERT INTO beta (SELECT a, a FROM raw_data);
> ANALYZE alpha;
> ANALYZE beta;
> EXPLAIN (COSTS OFF)
> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
>                          QUERY PLAN
> ------------------------------------------------------------
>  Hash Join
>    Hash Cond: ((t2.a)::text = (t1.a)::text)
>    ->  Seq Scan on beta t2
>    ->  Hash
>          ->  Seq Scan on alpha t1
>                Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
> (6 rows)
>
> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
> ERROR:  could not determine which collation to use for string hashing
> HINT:  Use the COLLATE clause to set the collation explicitly.
>


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Attachment

Re: Hash join not finding which collation to use for string hashing

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding
tothat patch, I’m starting this new thread. 
> I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash
partitionvs UPDATE” thread.  As such, I’ve included Tom and Amit in the CC list. 

Hm, I don't see any bug here.  You're asking it to join

>> CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
>> CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");

>> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');

so t1.a and t2.a have different collations, and the system can't resolve
which to use for the comparison.

Now, I'd be the first to agree that this error could be reported better.
The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
what it does *not* know is whether the '=' operator cares for collation.
Throwing an error when the operator wouldn't care at runtime isn't going
to make many people happy.  On the other hand, when the operator finally
does run and can't get a collation, all it knows is that it didn't get a
collation, not why.  So we can't produce an error message as specific as
"ja_JP and tr_TR collations conflict".

Now that the collations feature has settled in, it'd be nice to go back
and see if we can't improve that somehow.  Not sure how.

(BTW, before v12 the text '=' operator indeed did not care for collation,
so this example would've worked.  But the change in behavior is a
necessary consequence of having invented nondeterministic collations,
not a bug.)

            regards, tom lane



Re: Hash join not finding which collation to use for string hashing

From
Mark Dilger
Date:

> On Jan 28, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding
tothat patch, I’m starting this new thread. 
>> I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash
partitionvs UPDATE” thread.  As such, I’ve included Tom and Amit in the CC list. 
>
> Hm, I don't see any bug here.  You're asking it to join
>
>>> CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
>>> CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");
>
>>> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
>
> so t1.a and t2.a have different collations, and the system can't resolve
> which to use for the comparison.
>
> Now, I'd be the first to agree that this error could be reported better.
> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
> what it does *not* know is whether the '=' operator cares for collation.
> Throwing an error when the operator wouldn't care at runtime isn't going
> to make many people happy.  On the other hand, when the operator finally
> does run and can't get a collation, all it knows is that it didn't get a
> collation, not why.  So we can't produce an error message as specific as
> "ja_JP and tr_TR collations conflict".
>
> Now that the collations feature has settled in, it'd be nice to go back
> and see if we can't improve that somehow.  Not sure how.
>
> (BTW, before v12 the text '=' operator indeed did not care for collation,
> so this example would've worked.  But the change in behavior is a
> necessary consequence of having invented nondeterministic collations,
> not a bug.)

I contemplated that for a while before submitting the report.  I agree that for strings that are not binary equal, some
collationsmight say the two strings are equal, and other collations may say that they are not.  But when does any
collationsay that a string is not equal to itself?  All the strings in these columns were loaded from the same source
table,and they should always equal themselves, so the only problem I am aware of is if some of them equal others of
themunder one of the collations in question, where the other collation doesn’t think so.  I’m pretty sure that does not
existin this concrete example. 

I guess I’m arguing that the system is giving up too soon, saying, “In theory there might be values I don’t know how to
compare,so I’m going to give up now and not look”. 

I think what is happening here is that the system thinks, “Hey, I can use a hash join for this”, and then later
realizes,“Oh, no, I can’t” and instead of falling back to something other than hash join, it gives up. 

Is there some more fundamental reason this query couldn’t correctly be completed?  I don’t mind being enlightened about
thepart that I’m missing. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Hash join not finding which collation to use for string hashing

From
Mark Dilger
Date:

> On Jan 28, 2020, at 7:38 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>
>
>> On Jan 28, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>>> While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding
tothat patch, I’m starting this new thread. 
>>> I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash
partitionvs UPDATE” thread.  As such, I’ve included Tom and Amit in the CC list. 
>>
>> Hm, I don't see any bug here.  You're asking it to join
>>
>>>> CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
>>>> CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");
>>
>>>> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
>>
>> so t1.a and t2.a have different collations, and the system can't resolve
>> which to use for the comparison.
>>
>> Now, I'd be the first to agree that this error could be reported better.
>> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
>> what it does *not* know is whether the '=' operator cares for collation.
>> Throwing an error when the operator wouldn't care at runtime isn't going
>> to make many people happy.  On the other hand, when the operator finally
>> does run and can't get a collation, all it knows is that it didn't get a
>> collation, not why.  So we can't produce an error message as specific as
>> "ja_JP and tr_TR collations conflict".
>>
>> Now that the collations feature has settled in, it'd be nice to go back
>> and see if we can't improve that somehow.  Not sure how.
>>
>> (BTW, before v12 the text '=' operator indeed did not care for collation,
>> so this example would've worked.  But the change in behavior is a
>> necessary consequence of having invented nondeterministic collations,
>> not a bug.)
>
> I contemplated that for a while before submitting the report.  I agree that for strings that are not binary equal,
somecollations might say the two strings are equal, and other collations may say that they are not. But when does any
collationsay that a string is not equal to itself?  All the strings in these columns were loaded from the same source
table,and they should always equal themselves, so the only problem I am aware of is if some of them equal others of
themunder one of the collations in question, where the other collation doesn’t think so.  I’m pretty sure that does not
existin this concrete example. 
>
> I guess I’m arguing that the system is giving up too soon, saying, “In theory there might be values I don’t know how
tocompare, so I’m going to give up now and not look”. 
>
> I think what is happening here is that the system thinks, “Hey, I can use a hash join for this”, and then later
realizes,“Oh, no, I can’t” and instead of falling back to something other than hash join, it gives up. 
>
> Is there some more fundamental reason this query couldn’t correctly be completed?  I don’t mind being enlightened
aboutthe part that I’m missing. 

If the answer here is just that you’d rather it always fail at planning time because that’s more deterministic than
havingit sometimes succeed and sometimes fail at runtime depending on which data has been loaded, ok, I can understand
that. If so, then let’s put this error string into the docs, because right now, if you google 

    site:postgresql.org "could not determine which collation to use for string hashing”

you don’t get anything from the docs telling you that this is an expected outcome.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Hash join not finding which collation to use for string hashing

From
Amit Langote
Date:
Hi Mark,

On Wed, Jan 29, 2020 at 1:03 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> > On Jan 28, 2020, at 7:38 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> >> Mark Dilger <mark.dilger@enterprisedb.com> writes:
> >>> While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than
respondingto that patch, I’m starting this new thread. 
> >>> I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash
partitionvs UPDATE” thread.  As such, I’ve included Tom and Amit in the CC list. 

Just to clarify, we only intended in the quoted thread to plug
relevant holes of the *partitioning* code, which IIRC was more
straightforward to do than appears to be the case here.

> If the answer here is just that you’d rather it always fail at planning time because that’s more deterministic than
havingit sometimes succeed and sometimes fail at runtime depending on which data has been loaded, ok, I can understand
that. If so, then let’s put this error string into the docs, because right now, if you google 
>
>         site:postgresql.org "could not determine which collation to use for string hashing”
>
> you don’t get anything from the docs telling you that this is an expected outcome.

You may have noticed that it's not only hash join that bails out:

EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
                         QUERY PLAN
------------------------------------------------------------
 Hash Join
   Hash Cond: ((t2.a)::text = (t1.a)::text)
   ->  Seq Scan on beta t2
   ->  Hash
         ->  Seq Scan on alpha t1
               Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
(6 rows)

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR:  could not determine which collation to use for string hashing
HINT:  Use the COLLATE clause to set the collation explicitly.

SET enable_hashjoin TO off;
-- error occurs partway through ExecInitMergeJoin(), so EXPLAIN can't finish
EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.

SET enable_mergejoin TO off;
EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
                         QUERY PLAN
------------------------------------------------------------
 Nested Loop
   Join Filter: ((t1.a)::text = (t2.a)::text)
   ->  Seq Scan on beta t2
   ->  Materialize
         ->  Seq Scan on alpha t1
               Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
(6 rows)

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.

With PG 11, I can see that hash join and nestloop join work.  But with
PG 12, this join can't possible work without an explicit COLLATE
clause.  So it would be nice if we can report a more specific error
much sooner, possibly with some parser context, given that we now know
for sure that a join qual without a collation assigned will not work
at all.  IOW, maybe we should aim for making the run-time collation
errors to be of "won't happen" category as much as possible.

Tom said:
> >> Now, I'd be the first to agree that this error could be reported better.
> >> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
> >> what it does *not* know is whether the '=' operator cares for collation.
> >> Throwing an error when the operator wouldn't care at runtime isn't going
> >> to make many people happy.  On the other hand, when the operator finally
> >> does run and can't get a collation, all it knows is that it didn't get a
> >> collation, not why.  So we can't produce an error message as specific as
> >> "ja_JP and tr_TR collations conflict".
> >>
> >> Now that the collations feature has settled in, it'd be nice to go back
> >> and see if we can't improve that somehow.  Not sure how.

Would it make sense to catch a qual with unassigned collation
somewhere in the planner, where the qual's operator family is
estatblished, by checking if the operator family behavior is sensitive
to collations?

Thanks,
Amit



Re: Hash join not finding which collation to use for string hashing

From
Mark Dilger
Date:

> On Jan 29, 2020, at 10:14 PM, Amit Langote <amitlangote09@gmail.com> wrote:
>
>
> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
> WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
> ERROR:  could not determine which collation to use for string comparison
> HINT:  Use the COLLATE clause to set the collation explicitly.
>
> With PG 11, I can see that hash join and nestloop join work.  But with
> PG 12, this join can't possible work without an explicit COLLATE
> clause.  So it would be nice if we can report a more specific error
> much sooner, possibly with some parser context, given that we now know
> for sure that a join qual without a collation assigned will not work
> at all.  IOW, maybe we should aim for making the run-time collation
> errors to be of "won't happen" category as much as possible.
>
> Tom said:
>>>> Now, I'd be the first to agree that this error could be reported better.
>>>> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
>>>> what it does *not* know is whether the '=' operator cares for collation.
>>>> Throwing an error when the operator wouldn't care at runtime isn't going
>>>> to make many people happy.  On the other hand, when the operator finally
>>>> does run and can't get a collation, all it knows is that it didn't get a
>>>> collation, not why.  So we can't produce an error message as specific as
>>>> "ja_JP and tr_TR collations conflict".
>>>>
>>>> Now that the collations feature has settled in, it'd be nice to go back
>>>> and see if we can't improve that somehow.  Not sure how.
>
> Would it make sense to catch a qual with unassigned collation
> somewhere in the planner, where the qual's operator family is
> estatblished, by checking if the operator family behavior is sensitive
> to collations?

Hi Amit, I appreciate your attention to my question, but I’m not ready to delve into possible fixes, as I still don’t
entirelyunderstand the problem. 

According to Tom:

> (BTW, before v12 the text '=' operator indeed did not care for collation,
> so this example would've worked.  But the change in behavior is a
> necessary consequence of having invented nondeterministic collations,
> not a bug.)

I’m still struggling with that, because the four collations I used in the example are all deterministic.  I totally
understandwhy having more than one collation matters if you ask that your data be in sorted order, as the system needs
toknow which ordering to use.  But for equality, I would think that deterministic collations are all interchangeable,
becausethey all agree on whether A = B, regardless of the collation defined on column A and/or on column B.  Maybe I’m
wrongabout that.  But that’s my reading of the definition of “deterministic collation” given in the docs: 

> A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if
theyconsist of the same byte sequence.  

I’m reading that as “If and only if”, and maybe I’m wrong to do so.  Maybe that’s my error.  But assuming that part is
ok,it would seem to be sufficient to know that the columns being joined use deterministic collations, and you wouldn’t
needthem to be the *same* collations, nor even remember which collations they were.  You’d just need information passed
downthat collations can be ignored for this comparison, or that a built-in byte-for-byte equality comparator should be
usedrather than the collation’s equality comparator, or some such solution. 

I’m guessing I’m wrong about at least one of these things, and I’m hoping somebody enlightens me.

Thanks so much in advance,


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Hash join not finding which collation to use for string hashing

From
Mark Dilger
Date:

>> Would it make sense to catch a qual with unassigned collation
>> somewhere in the planner, where the qual's operator family is
>> estatblished, by checking if the operator family behavior is sensitive
>> to collations?

I’m not sure how to do that.  pg_opfamily doesn’t seem to have a field for that.  Can you recommend how I would proceed
there?



There may be operators other than = and != that are worth thinking about, but for now I’m really only thinking about
thosetwo.  It is hard for me to see how we could ignore a collation mismatch for any other operator, but maybe I’m just
notthinking about it the right way. 

So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first
thingsit does.  That’s where the error that annoys me comes out.  But I don’t think it really needs to be doing this.
Itshould first be determining if collation *matters*.  It can’t do that right now, because it gets that information
fromthis line: 

>     if (lc_collate_is_c(collid) ||
>         collid == DEFAULT_COLLATION_OID ||
>         pg_newlocale_from_collation(collid)->deterministic)

Which obviously won’t work if collid hasn’t been set.  So three approaches come to my mind:

1) Somewhere upstream from calling texteq, figure out that we don’t actually care about the collation stuff, because
boththe left and right side of the comparison use deterministic collations and the comparison we’re calling is
equality,and then pass down a dummy collation such as “C” even though that isn’t actually true. 

The problem with (1) that I see is that it could lead to the wrong collation being mentioned in error messages, though
Ihaven’t looked, and that it’s enough of a hack that it might make coding in this area harder in the future. 

2) Somewhere upstream from calling texteq, pass in a new boolean flag that specifies whether collation matters, and
extendtexteq to take an additional argument. 

This also seems very hacky to me, but for different reasons.

3) Extend the concept of collations to collation sets.  Right now, I’m only thinking about a collation set as having
twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get
builtup and need to work.  Anyway, at the point in the executor that the collations don’t match, instead of passing
NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with
twodifferent collations and decide if they can deal with that or if they need to throw an error. 

I bet if we went with (3), the error being thrown in the example I used to start this thread would go away, without
breakinganything else.  I’m going to go poke at that a bit, but I’d still appreciate any comments/concerns about my
analysis.

> According to Tom:
>
>> (BTW, before v12 the text '=' operator indeed did not care for collation,
>> so this example would've worked.  But the change in behavior is a
>> necessary consequence of having invented nondeterministic collations,
>> not a bug.)
>
> I’m still struggling with that, because the four collations I used in the example are all deterministic.  I totally
understandwhy having more than one collation matters if you ask that your data be in sorted order, as the system needs
toknow which ordering to use.  But for equality, I would think that deterministic collations are all interchangeable,
becausethey all agree on whether A = B, regardless of the collation defined on column A and/or on column B.  Maybe I’m
wrongabout that.  But that’s my reading of the definition of “deterministic collation” given in the docs: 
>
>> A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if
theyconsist of the same byte sequence.  
>
> I’m reading that as “If and only if”, and maybe I’m wrong to do so.  Maybe that’s my error.  But assuming that part
isok, it would seem to be sufficient to know that the columns being joined use deterministic collations, and you
wouldn’tneed them to be the *same* collations, nor even remember which collations they were.  You’d just need
informationpassed down that collations can be ignored for this comparison, or that a built-in byte-for-byte equality
comparatorshould be used rather than the collation’s equality comparator, or some such solution. 

I’m starting to think that “consequence of having invented nondeterministic collations” in Tom’s message really should
read“consequence of having invented nondeterministic collations without reworking these other interfaces”, but once
again,I’m hoping to be corrected if I’ve gone off in the wrong direction here. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Hash join not finding which collation to use for string hashing

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> According to Tom:
>> (BTW, before v12 the text '=' operator indeed did not care for collation,
>> so this example would've worked.  But the change in behavior is a
>> necessary consequence of having invented nondeterministic collations,
>> not a bug.)

> I’m still struggling with that, because the four collations I used in
> the example are all deterministic.  I totally understand why having more
> than one collation matters if you ask that your data be in sorted order,
> as the system needs to know which ordering to use.  But for equality, I
> would think that deterministic collations are all interchangeable,
> because they all agree on whether A = B, regardless of the collation
> defined on column A and/or on column B.  Maybe I’m wrong about that.

Well, you're not wrong, but you're assuming much finer distinctions
than the collation machinery actually makes (or than it'd be sane
to ask it to make, IMO).  We don't have a way to tell texteq that
"well, we don't know what collation to assign to this operation,
but it's okay to assume that it's deterministic".  Nor does the
parser have any way to know that texteq could be satisfied by
that knowledge --- if it doesn't even know whether texteq cares
about collation, how could it know that?

There are other issues here too.  Just because the query could
theoretically be implemented without reference to any specific
collation doesn't mean that that's a practical thing to do.
It'd be unclear for instance whether we can safely use indexes
that *do* have specific collations attached.  We'd also lose
the option to consider plans like mergejoins.

If the parser understood that a particular operator behaved
like text equality --- which it does not, and I guarantee you
I will shoot down any proposal to hard-wire a parser test for
that particular operator --- you could imagine assigning "C"
collation when we have an unresolvable combination of
deterministic collations for the inputs.  That dodges the
problem of not having any way to represent the situation.
But it's still got implementation issues, in that such a
collation choice probably won't match the collations of any
indexes for the input columns.

Another issue is that collations "bubble up" in the parse tree,
so sneaking in a collation that's not supposed to be there per
spec carries a risk of causing unexpected semantics further up.
I think we could get away with that for the particular case of
equality (which returns collation-less boolean), but this is
another thing that makes the case narrower and less useful.

In the end, TBH, I'm not finding your example compelling enough
to be worth putting in weird hacks for such cases.  If you're
joining columns of dissimilar collations, you're going to be
finding it necessary to specify what collation to use in a lot
of places ... so where's the value in making a weird special
case for equality?

            regards, tom lane



Re: Hash join not finding which collation to use for string hashing

From
Robert Haas
Date:
On Thu, Jan 30, 2020 at 2:44 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> 3) Extend the concept of collations to collation sets.  Right now, I’m only thinking about a collation set as having
twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get
builtup and need to work.  Anyway, at the point in the executor that the collations don’t match, instead of passing
NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with
twodifferent collations and decide if they can deal with that or if they need to throw an error. 
>
> I bet if we went with (3), the error being thrown in the example I used to start this thread would go away, without
breakinganything else.  I’m going to go poke at that a bit, but I’d still appreciate any comments/concerns about my
analysis.

I assume that what would have to happen to implement this is that an
SQL-callable function would be passed more than one collation OID,
perhaps one per argument or something like that. Notice, however, that
this would require changing the way that functions get called. See the
DirectFunctionCall{1,2,3,...}Coll() and
FunctionCall{0,1,2,3,...}Coll() and the definition of
FunctionCallInfoBaseData -- there's only one spot for an OID available
right now. Allowing for more would likely have a noticeable impact on
the cost of calling SQL-callable functions, and that's already
expensive enough that people have been unhappy about it. It seems
unlikely that it would be worth incurring more overhead here for every
query all the time just to make this case work.

It is, perhaps, a little strange that the only two choices for an
operator are "cares about collation" and "doesn't," and I somehow feel
like there ought to be a way to do better. But I don't know what it
is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Hash join not finding which collation to use for string hashing

From
Mark Dilger
Date:

> On Jan 30, 2020, at 12:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> According to Tom:
>>> (BTW, before v12 the text '=' operator indeed did not care for collation,
>>> so this example would've worked.  But the change in behavior is a
>>> necessary consequence of having invented nondeterministic collations,
>>> not a bug.)
>
>> I’m still struggling with that, because the four collations I used in
>> the example are all deterministic.  I totally understand why having more
>> than one collation matters if you ask that your data be in sorted order,
>> as the system needs to know which ordering to use.  But for equality, I
>> would think that deterministic collations are all interchangeable,
>> because they all agree on whether A = B, regardless of the collation
>> defined on column A and/or on column B.  Maybe I’m wrong about that.
>
> Well, you're not wrong, but you're assuming much finer distinctions
> than the collation machinery actually makes (or than it'd be sane
> to ask it to make, IMO).  We don't have a way to tell texteq that
> "well, we don't know what collation to assign to this operation,
> but it's okay to assume that it's deterministic".  Nor does the
> parser have any way to know that texteq could be satisfied by
> that knowledge --- if it doesn't even know whether texteq cares
> about collation, how could it know that?

I agree.  Having this in the parser seems really weird and unwholesome.

> There are other issues here too.  Just because the query could
> theoretically be implemented without reference to any specific
> collation doesn't mean that that's a practical thing to do.
> It'd be unclear for instance whether we can safely use indexes
> that *do* have specific collations attached.  We'd also lose
> the option to consider plans like mergejoins.
>
> If the parser understood that a particular operator behaved
> like text equality --- which it does not, and I guarantee you
> I will shoot down any proposal to hard-wire a parser test for
> that particular operator --- you could imagine assigning "C"
> collation when we have an unresolvable combination of
> deterministic collations for the inputs.  That dodges the
> problem of not having any way to represent the situation.
> But it's still got implementation issues, in that such a
> collation choice probably won't match the collations of any
> indexes for the input columns.

Yeah, I disclaimed that idea in a subsequent email, but if you’re responding to my emails in the order that you receive
them(which is totally reasonable), then you aren’t to know that yet. 

>
> Another issue is that collations "bubble up" in the parse tree,
> so sneaking in a collation that's not supposed to be there per
> spec carries a risk of causing unexpected semantics further up.
> I think we could get away with that for the particular case of
> equality (which returns collation-less boolean), but this is
> another thing that makes the case narrower and less useful.

I was wondering if bubbling up (LeftCollation,RightCollation) would be ok.  There are likely cases that can’t make use
ofthat, but those places would just throw the same sort of error that they’re currently throwing, except they’d have a
moreuseful error message because it would include which collations were mismatched. 

> In the end, TBH, I'm not finding your example compelling enough
> to be worth putting in weird hacks for such cases.  If you're
> joining columns of dissimilar collations, you're going to be
> finding it necessary to specify what collation to use in a lot
> of places ... so where's the value in making a weird special
> case for equality?

I agree with your position against weird hacks.  If the only way to do this is a weird hack, then forget about it.

If I’m not putting upon your time too much, could you respond to my other email in this thread as to whether it sounds
anybetter? 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Hash join not finding which collation to use for string hashing

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> Would it make sense to catch a qual with unassigned collation
> somewhere in the planner, where the qual's operator family is
> estatblished, by checking if the operator family behavior is sensitive
> to collations?

> I’m not sure how to do that.  pg_opfamily doesn’t seem to have a field for that.  Can you recommend how I would
proceedthere? 

There's no such information attached to opfamilies, which is more or less
forced by the fact that individual operators don't expose it either.
There's not much hope of making that better without incompatible changes
in the requirements for extensions to define operators and/or operator
families.

> So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first
thingsit does.  That’s where the error that annoys me comes out.  But I don’t think it really needs to be doing this.
Itshould first be determining if collation *matters*. 

But of course it matters.  How do you know whether the operation is
deterministic if you don't know the collation?

> 3) Extend the concept of collations to collation sets.  Right now, I’m only thinking about a collation set as having
twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get
builtup and need to work.  Anyway, at the point in the executor that the collations don’t match, instead of passing
NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with
twodifferent collations and decide if they can deal with that or if they need to throw an error. 

Maybe this could work.  I think it would get messy when bubbling up
collations, but as long as you're talking about "sets" not "pairs"
it might be possible to postpone collation resolution.

To me, though, the main advantage of this is that we could throw a
more explicit error like "collations "ja_JP" and "tr_TR" cannot be
unified", since that information would still be there at runtime.
I'm still pretty dubious that having texteq special-case the situation
where the collations are different but all deterministic is a reasonable
thing to do.

One practical problem is that postponing that work to runtime could be
a huge performance hit, because you'd have to do it over again on each
call of the operator.  I suppose some caching might be possible.

Another issue is that you're still putting far too much emphasis on
the fact that a hash-join plan manages to avoid this error, and ignoring
the problem that a lot of other plans for the same query will not avoid
it.  What if the planner had chosen a merge-join, for instance?  How
useful is it to allow the join if things still break the moment you
add an ORDER BY?

            regards, tom lane



Re: Hash join not finding which collation to use for string hashing

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I assume that what would have to happen to implement this is that an
> SQL-callable function would be passed more than one collation OID,
> perhaps one per argument or something like that. Notice, however, that
> this would require changing the way that functions get called. See the
> DirectFunctionCall{1,2,3,...}Coll() and
> FunctionCall{0,1,2,3,...}Coll() and the definition of
> FunctionCallInfoBaseData -- there's only one spot for an OID available
> right now. Allowing for more would likely have a noticeable impact on
> the cost of calling SQL-callable functions, and that's already
> expensive enough that people have been unhappy about it. It seems
> unlikely that it would be worth incurring more overhead here for every
> query all the time just to make this case work.

The implementation I was visualizing was replacing, eg,
FuncExpr.inputcollid with an OID List, and then teaching PG_GET_COLLATION
to throw an error if the list is longer than one element.  I agree that
the performance implications of that would be pretty troublesome, though.

In the end, it seems like the only solution that would be remotely
practical from a performance standpoint is to redefine things so that
collation-sensitive functions have to be labeled as such in pg_proc,
and then we can have the parser throw the appropriate error if it
can't resolve an input collation for such a function.  Perhaps the
backwards-compatibility hit wouldn't be as bad as it first seems,
since the whole thing can be ignored for functions that haven't got at
least one collatable input, and most of those would likely be all right
with a default assumption that they are collation sensitive.  Or maybe
better, we could make the default assumption be that they aren't
sensitive, with the same error still being thrown at runtime if they are,
so that extensions have to take positive action to get the better error
behavior but if they don't then things are no worse than today.

Mark, obviously, would then lobby for the pg_proc marking to
include one state that identifies functions that only care about
collation when it's nondeterministic.  But I'm still not very
sure how that would work as soon as you look anyplace except at
what texteq() itself would do.  The questions of whether such a
query matches a given index, or could be implemented via mergejoin,
etc, remain.

            regards, tom lane



Re: Hash join not finding which collation to use for string hashing

From
Mark Dilger
Date:

> On Jan 30, 2020, at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> Would it make sense to catch a qual with unassigned collation
>> somewhere in the planner, where the qual's operator family is
>> estatblished, by checking if the operator family behavior is sensitive
>> to collations?
>
>> I’m not sure how to do that.  pg_opfamily doesn’t seem to have a field for that.  Can you recommend how I would
proceedthere? 
>
> There's no such information attached to opfamilies, which is more or less
> forced by the fact that individual operators don't expose it either.
> There's not much hope of making that better without incompatible changes
> in the requirements for extensions to define operators and/or operator
> families.

Thanks, Tom, for confirming this.

Given the excellent explanations you and Robert have given, I think I’m retracting this whole idea and accepting your
positionsthat it’s not worth it. 

For the archives, I’m still going to respond to the rest of what you say:

>> So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first
thingsit does.  That’s where the error that annoys me comes out.  But I don’t think it really needs to be doing this.
Itshould first be determining if collation *matters*. 
>
> But of course it matters.  How do you know whether the operation is
> deterministic if you don't know the collation?
>
>> 3) Extend the concept of collations to collation sets.  Right now, I’m only thinking about a collation set as having
twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get
builtup and need to work.  Anyway, at the point in the executor that the collations don’t match, instead of passing
NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with
twodifferent collations and decide if they can deal with that or if they need to throw an error. 
>
> Maybe this could work.  I think it would get messy when bubbling up
> collations, but as long as you're talking about "sets" not "pairs"
> it might be possible to postpone collation resolution.
>
> To me, though, the main advantage of this is that we could throw a
> more explicit error like "collations "ja_JP" and "tr_TR" cannot be
> unified", since that information would still be there at runtime.
> I'm still pretty dubious that having texteq special-case the situation
> where the collations are different but all deterministic is a reasonable
> thing to do.

On my mac, when I run “SELECT * FROM pg_collation”, every one of the 271 rows I get back have collisdeterministic true.
Iknow that which collations you get on a system is variable, so I’m not saying that nobody has nondeterministic
collations,but it seems common enough that mismatched collations will both be deterministic.  That’s the common case,
notsome weird edge case. 

So the issue here seems to be whether equality should get different treatment from other operators, and I obviously am
arguingthat it should, though you and Robert have both made really good points against that position. 

>
> One practical problem is that postponing that work to runtime could be
> a huge performance hit, because you'd have to do it over again on each
> call of the operator.  I suppose some caching might be possible.

Yes, Robert mentioned performance implications, too.

> Another issue is that you're still putting far too much emphasis on
> the fact that a hash-join plan manages to avoid this error, and ignoring
> the problem that a lot of other plans for the same query will not avoid
> it.  What if the planner had chosen a merge-join, for instance?

You’re looking at the problem from the point of view of how postgres is currently and historically implemented, and
seeingthat this problem is hard.  I was looking at it more from the perspective of a user who gets the error message
andthinks, “this is stupid, the query is refusing to run for want of a collation being specified, but I can clearly see
thatit doesn’t actually need one.”  I think that same reaction from the user would happen if the planner chose a
merge-join. The user would just say, “gee, what a stupid planner, why did it choose a merge join for this when the lack
ofa collation clause clearly indicates that it should have limited itself to something that only needs equality
comparison.”

I’m not calling the planner stupid, nor the system generally, but I know that people get frustrated with systems that
haveunintuitive limitations like this when they don’t know the internals of the system that give rise to the
limitations.

> How
> useful is it to allow the join if things still break the moment you
> add an ORDER BY?

I think that’s apples-to-oranges.  If you ask the system to order the data, and you’ve got ambiguity about which
orderingyou mean, then of course it can’t continue until you tell it which collation you want. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Hash join not finding which collation to use for string hashing

From
Amit Langote
Date:
On Fri, Jan 31, 2020 at 6:15 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> > On Jan 30, 2020, at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Mark Dilger <mark.dilger@enterprisedb.com> writes:
> >> Would it make sense to catch a qual with unassigned collation
> >> somewhere in the planner, where the qual's operator family is
> >> estatblished, by checking if the operator family behavior is sensitive
> >> to collations?
> >
> >> I’m not sure how to do that.  pg_opfamily doesn’t seem to have a field for that.  Can you recommend how I would
proceedthere? 
> >
> > There's no such information attached to opfamilies, which is more or less
> > forced by the fact that individual operators don't expose it either.
> > There's not much hope of making that better without incompatible changes
> > in the requirements for extensions to define operators and/or operator
> > families.
>
> Thanks, Tom, for confirming this.

Just for the record, I will explain why I brought up doing anything
with operator families at all.  What I was really imagining is putting
a hard-coded check somewhere in the middle of equivalence processing
to see if a given qual's operator would be sensitive to collation
based *only* on whether it belongs to a text operator family, such as
TEXT_BTREE_FAM_OID, whereas the qual expression's inputcollid is 0
(parser failed to resolve collation conflict among its arguments) and
erroring out if so.  If we do that, maybe we won't need
check_collation_set() that's used in various text operators.  Also,
erroring out sooner might allow us to produce more specific error
message, which as I understand it, would help with one of the Mark's
complaints that the error message is too ambiguous due to emitted as
such a low layer.

I thought of the idea after running into a recent commit relevant to
non-deterministic collations:

commit 2810396312664bdb941e549df7dfa75218d73a1c
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sat Sep 21 16:29:17 2019 -0400

    Fix up handling of nondeterministic collations with pattern_ops opclasses.

    text_pattern_ops and its siblings can't be used with nondeterministic
    collations, because they use the text_eq operator which will not behave
    as bitwise equality if applied with a nondeterministic collation.  The
    initial implementation of that restriction was to insert a run-time test
    in the related comparison functions, but that is inefficient, may throw
    misleading errors, and will throw errors in some cases that would work.
    It seems sufficient to just prevent the combination during CREATE INDEX,
    so do that instead.

    Lacking any better way to identify the opclasses involved, we need to
    hard-wire tests for them, which requires hand-assigned values for their
    OIDs, which forces a catversion bump because they previously had OIDs
    that would be assigned automatically.  That's slightly annoying in the
    v12 branch, but fortunately we're not at rc1 yet, so just do it.

    Discussion: https://postgr.es/m/22566.1568675619@sss.pgh.pa.us

IIUC, the above commit removes a check_collation_set() call from a
operator class comparison function in favor of ensuring that an index
using that operator class can only be defined with a deterministic
collation in the first place.  But as the above commit is purportedly
only a stop-gap solution due to lacking operator class infrastructure
to consider collation in operator semantics, maybe we shouldn't spread
such a hack in other places.

Thanks,
Amit