Re: Hash join not finding which collation to use for string hashing - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Hash join not finding which collation to use for string hashing
Date
Msg-id 9F96290F-13D8-4B97-ABCD-FC7DC1578EAF@enterprisedb.com
Whole thread Raw
In response to Re: Hash join not finding which collation to use for string hashing  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Hash join not finding which collation to use for string hashing  (Mark Dilger <mark.dilger@enterprisedb.com>)
List pgsql-hackers

> On Jan 28, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding
tothat patch, I’m starting this new thread. 
>> I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash
partitionvs UPDATE” thread.  As such, I’ve included Tom and Amit in the CC list. 
>
> Hm, I don't see any bug here.  You're asking it to join
>
>>> CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
>>> CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");
>
>>> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
>
> so t1.a and t2.a have different collations, and the system can't resolve
> which to use for the comparison.
>
> Now, I'd be the first to agree that this error could be reported better.
> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
> what it does *not* know is whether the '=' operator cares for collation.
> Throwing an error when the operator wouldn't care at runtime isn't going
> to make many people happy.  On the other hand, when the operator finally
> does run and can't get a collation, all it knows is that it didn't get a
> collation, not why.  So we can't produce an error message as specific as
> "ja_JP and tr_TR collations conflict".
>
> Now that the collations feature has settled in, it'd be nice to go back
> and see if we can't improve that somehow.  Not sure how.
>
> (BTW, before v12 the text '=' operator indeed did not care for collation,
> so this example would've worked.  But the change in behavior is a
> necessary consequence of having invented nondeterministic collations,
> not a bug.)

I contemplated that for a while before submitting the report.  I agree that for strings that are not binary equal, some
collationsmight say the two strings are equal, and other collations may say that they are not.  But when does any
collationsay that a string is not equal to itself?  All the strings in these columns were loaded from the same source
table,and they should always equal themselves, so the only problem I am aware of is if some of them equal others of
themunder one of the collations in question, where the other collation doesn’t think so.  I’m pretty sure that does not
existin this concrete example. 

I guess I’m arguing that the system is giving up too soon, saying, “In theory there might be values I don’t know how to
compare,so I’m going to give up now and not look”. 

I think what is happening here is that the system thinks, “Hey, I can use a hash join for this”, and then later
realizes,“Oh, no, I can’t” and instead of falling back to something other than hash join, it gives up. 

Is there some more fundamental reason this query couldn’t correctly be completed?  I don’t mind being enlightened about
thepart that I’m missing. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Is custom MemoryContext prohibited?
Next
From: Mark Dilger
Date:
Subject: Re: Hash join not finding which collation to use for string hashing