Re: Hash join not finding which collation to use for string hashing - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Hash join not finding which collation to use for string hashing
Date
Msg-id C3E3E904-F9BF-4774-B400-D0ADBCFA57A5@enterprisedb.com
Whole thread Raw
In response to Re: Hash join not finding which collation to use for string hashing  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: Hash join not finding which collation to use for string hashing
Re: Hash join not finding which collation to use for string hashing
List pgsql-hackers

>> Would it make sense to catch a qual with unassigned collation
>> somewhere in the planner, where the qual's operator family is
>> estatblished, by checking if the operator family behavior is sensitive
>> to collations?

I’m not sure how to do that.  pg_opfamily doesn’t seem to have a field for that.  Can you recommend how I would proceed
there?



There may be operators other than = and != that are worth thinking about, but for now I’m really only thinking about
thosetwo.  It is hard for me to see how we could ignore a collation mismatch for any other operator, but maybe I’m just
notthinking about it the right way. 

So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first
thingsit does.  That’s where the error that annoys me comes out.  But I don’t think it really needs to be doing this.
Itshould first be determining if collation *matters*.  It can’t do that right now, because it gets that information
fromthis line: 

>     if (lc_collate_is_c(collid) ||
>         collid == DEFAULT_COLLATION_OID ||
>         pg_newlocale_from_collation(collid)->deterministic)

Which obviously won’t work if collid hasn’t been set.  So three approaches come to my mind:

1) Somewhere upstream from calling texteq, figure out that we don’t actually care about the collation stuff, because
boththe left and right side of the comparison use deterministic collations and the comparison we’re calling is
equality,and then pass down a dummy collation such as “C” even though that isn’t actually true. 

The problem with (1) that I see is that it could lead to the wrong collation being mentioned in error messages, though
Ihaven’t looked, and that it’s enough of a hack that it might make coding in this area harder in the future. 

2) Somewhere upstream from calling texteq, pass in a new boolean flag that specifies whether collation matters, and
extendtexteq to take an additional argument. 

This also seems very hacky to me, but for different reasons.

3) Extend the concept of collations to collation sets.  Right now, I’m only thinking about a collation set as having
twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get
builtup and need to work.  Anyway, at the point in the executor that the collations don’t match, instead of passing
NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with
twodifferent collations and decide if they can deal with that or if they need to throw an error. 

I bet if we went with (3), the error being thrown in the example I used to start this thread would go away, without
breakinganything else.  I’m going to go poke at that a bit, but I’d still appreciate any comments/concerns about my
analysis.

> According to Tom:
>
>> (BTW, before v12 the text '=' operator indeed did not care for collation,
>> so this example would've worked.  But the change in behavior is a
>> necessary consequence of having invented nondeterministic collations,
>> not a bug.)
>
> I’m still struggling with that, because the four collations I used in the example are all deterministic.  I totally
understandwhy having more than one collation matters if you ask that your data be in sorted order, as the system needs
toknow which ordering to use.  But for equality, I would think that deterministic collations are all interchangeable,
becausethey all agree on whether A = B, regardless of the collation defined on column A and/or on column B.  Maybe I’m
wrongabout that.  But that’s my reading of the definition of “deterministic collation” given in the docs: 
>
>> A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if
theyconsist of the same byte sequence.  
>
> I’m reading that as “If and only if”, and maybe I’m wrong to do so.  Maybe that’s my error.  But assuming that part
isok, it would seem to be sufficient to know that the columns being joined use deterministic collations, and you
wouldn’tneed them to be the *same* collations, nor even remember which collations they were.  You’d just need
informationpassed down that collations can be ignored for this comparison, or that a built-in byte-for-byte equality
comparatorshould be used rather than the collation’s equality comparator, or some such solution. 

I’m starting to think that “consequence of having invented nondeterministic collations” in Tom’s message really should
read“consequence of having invented nondeterministic collations without reworking these other interfaces”, but once
again,I’m hoping to be corrected if I’ve gone off in the wrong direction here. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Enabling B-Tree deduplication by default
Next
From: Tom Lane
Date:
Subject: Re: Hash join not finding which collation to use for string hashing