Re: Hash join not finding which collation to use for string hashing - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Hash join not finding which collation to use for string hashing
Date
Msg-id 12204.1580414561@sss.pgh.pa.us
Whole thread Raw
In response to Re: Hash join not finding which collation to use for string hashing  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: Hash join not finding which collation to use for string hashing  (Mark Dilger <mark.dilger@enterprisedb.com>)
List pgsql-hackers
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> According to Tom:
>> (BTW, before v12 the text '=' operator indeed did not care for collation,
>> so this example would've worked.  But the change in behavior is a
>> necessary consequence of having invented nondeterministic collations,
>> not a bug.)

> I’m still struggling with that, because the four collations I used in
> the example are all deterministic.  I totally understand why having more
> than one collation matters if you ask that your data be in sorted order,
> as the system needs to know which ordering to use.  But for equality, I
> would think that deterministic collations are all interchangeable,
> because they all agree on whether A = B, regardless of the collation
> defined on column A and/or on column B.  Maybe I’m wrong about that.

Well, you're not wrong, but you're assuming much finer distinctions
than the collation machinery actually makes (or than it'd be sane
to ask it to make, IMO).  We don't have a way to tell texteq that
"well, we don't know what collation to assign to this operation,
but it's okay to assume that it's deterministic".  Nor does the
parser have any way to know that texteq could be satisfied by
that knowledge --- if it doesn't even know whether texteq cares
about collation, how could it know that?

There are other issues here too.  Just because the query could
theoretically be implemented without reference to any specific
collation doesn't mean that that's a practical thing to do.
It'd be unclear for instance whether we can safely use indexes
that *do* have specific collations attached.  We'd also lose
the option to consider plans like mergejoins.

If the parser understood that a particular operator behaved
like text equality --- which it does not, and I guarantee you
I will shoot down any proposal to hard-wire a parser test for
that particular operator --- you could imagine assigning "C"
collation when we have an unresolvable combination of
deterministic collations for the inputs.  That dodges the
problem of not having any way to represent the situation.
But it's still got implementation issues, in that such a
collation choice probably won't match the collations of any
indexes for the input columns.

Another issue is that collations "bubble up" in the parse tree,
so sneaking in a collation that's not supposed to be there per
spec carries a risk of causing unexpected semantics further up.
I think we could get away with that for the particular case of
equality (which returns collation-less boolean), but this is
another thing that makes the case narrower and less useful.

In the end, TBH, I'm not finding your example compelling enough
to be worth putting in weird hacks for such cases.  If you're
joining columns of dissimilar collations, you're going to be
finding it necessary to specify what collation to use in a lot
of places ... so where's the value in making a weird special
case for equality?

            regards, tom lane



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: Hash join not finding which collation to use for string hashing
Next
From: Robert Haas
Date:
Subject: Re: Hash join not finding which collation to use for string hashing