Thread: Hash join not finding which collation to use for string hashing
While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding to thatpatch, I’m starting this new thread. I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partitionvs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list. Notice the "ERROR: could not determine which collation to use for string hashing” The following is extracted from the output from the test: > CREATE TABLE raw_data (a text); > INSERT INTO raw_data (a) VALUES ('Türkiye'), > ('TÜRKIYE'), > ('bıt'), > ('BIT'), > ('äbç'), > ('ÄBÇ'), > ('aaá'), > ('coté'), > ('Götz'), > ('ὀδυσσεύς'), > ('ὈΔΥΣΣΕΎΣ'), > ('を読み取り用'), > ('にオープンできませんでした'); > -- Create unpartitioned tables for test > CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE"); > CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US"); > INSERT INTO alpha (SELECT a, a FROM raw_data); > INSERT INTO beta (SELECT a, a FROM raw_data); > ANALYZE alpha; > ANALYZE beta; > EXPLAIN (COSTS OFF) > SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); > QUERY PLAN > ------------------------------------------------------------ > Hash Join > Hash Cond: ((t2.a)::text = (t1.a)::text) > -> Seq Scan on beta t2 > -> Hash > -> Seq Scan on alpha t1 > Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) > (6 rows) > > SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); > ERROR: could not determine which collation to use for string hashing > HINT: Use the COLLATE clause to set the collation explicitly. > — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Mark Dilger <mark.dilger@enterprisedb.com> writes: > While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding tothat patch, I’m starting this new thread. > I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partitionvs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list. Hm, I don't see any bug here. You're asking it to join >> CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE"); >> CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US"); >> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); so t1.a and t2.a have different collations, and the system can't resolve which to use for the comparison. Now, I'd be the first to agree that this error could be reported better. The parser knows that it couldn't resolve a collation for t1.a = t2.a, but what it does *not* know is whether the '=' operator cares for collation. Throwing an error when the operator wouldn't care at runtime isn't going to make many people happy. On the other hand, when the operator finally does run and can't get a collation, all it knows is that it didn't get a collation, not why. So we can't produce an error message as specific as "ja_JP and tr_TR collations conflict". Now that the collations feature has settled in, it'd be nice to go back and see if we can't improve that somehow. Not sure how. (BTW, before v12 the text '=' operator indeed did not care for collation, so this example would've worked. But the change in behavior is a necessary consequence of having invented nondeterministic collations, not a bug.) regards, tom lane
> On Jan 28, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Mark Dilger <mark.dilger@enterprisedb.com> writes: >> While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding tothat patch, I’m starting this new thread. >> I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partitionvs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list. > > Hm, I don't see any bug here. You're asking it to join > >>> CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE"); >>> CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US"); > >>> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); > > so t1.a and t2.a have different collations, and the system can't resolve > which to use for the comparison. > > Now, I'd be the first to agree that this error could be reported better. > The parser knows that it couldn't resolve a collation for t1.a = t2.a, but > what it does *not* know is whether the '=' operator cares for collation. > Throwing an error when the operator wouldn't care at runtime isn't going > to make many people happy. On the other hand, when the operator finally > does run and can't get a collation, all it knows is that it didn't get a > collation, not why. So we can't produce an error message as specific as > "ja_JP and tr_TR collations conflict". > > Now that the collations feature has settled in, it'd be nice to go back > and see if we can't improve that somehow. Not sure how. > > (BTW, before v12 the text '=' operator indeed did not care for collation, > so this example would've worked. But the change in behavior is a > necessary consequence of having invented nondeterministic collations, > not a bug.) I contemplated that for a while before submitting the report. I agree that for strings that are not binary equal, some collationsmight say the two strings are equal, and other collations may say that they are not. But when does any collationsay that a string is not equal to itself? All the strings in these columns were loaded from the same source table,and they should always equal themselves, so the only problem I am aware of is if some of them equal others of themunder one of the collations in question, where the other collation doesn’t think so. I’m pretty sure that does not existin this concrete example. I guess I’m arguing that the system is giving up too soon, saying, “In theory there might be values I don’t know how to compare,so I’m going to give up now and not look”. I think what is happening here is that the system thinks, “Hey, I can use a hash join for this”, and then later realizes,“Oh, no, I can’t” and instead of falling back to something other than hash join, it gives up. Is there some more fundamental reason this query couldn’t correctly be completed? I don’t mind being enlightened about thepart that I’m missing. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jan 28, 2020, at 7:38 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > >> On Jan 28, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Mark Dilger <mark.dilger@enterprisedb.com> writes: >>> While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding tothat patch, I’m starting this new thread. >>> I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partitionvs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list. >> >> Hm, I don't see any bug here. You're asking it to join >> >>>> CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE"); >>>> CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US"); >> >>>> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); >> >> so t1.a and t2.a have different collations, and the system can't resolve >> which to use for the comparison. >> >> Now, I'd be the first to agree that this error could be reported better. >> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but >> what it does *not* know is whether the '=' operator cares for collation. >> Throwing an error when the operator wouldn't care at runtime isn't going >> to make many people happy. On the other hand, when the operator finally >> does run and can't get a collation, all it knows is that it didn't get a >> collation, not why. So we can't produce an error message as specific as >> "ja_JP and tr_TR collations conflict". >> >> Now that the collations feature has settled in, it'd be nice to go back >> and see if we can't improve that somehow. Not sure how. >> >> (BTW, before v12 the text '=' operator indeed did not care for collation, >> so this example would've worked. But the change in behavior is a >> necessary consequence of having invented nondeterministic collations, >> not a bug.) > > I contemplated that for a while before submitting the report. I agree that for strings that are not binary equal, somecollations might say the two strings are equal, and other collations may say that they are not. But when does any collationsay that a string is not equal to itself? All the strings in these columns were loaded from the same source table,and they should always equal themselves, so the only problem I am aware of is if some of them equal others of themunder one of the collations in question, where the other collation doesn’t think so. I’m pretty sure that does not existin this concrete example. > > I guess I’m arguing that the system is giving up too soon, saying, “In theory there might be values I don’t know how tocompare, so I’m going to give up now and not look”. > > I think what is happening here is that the system thinks, “Hey, I can use a hash join for this”, and then later realizes,“Oh, no, I can’t” and instead of falling back to something other than hash join, it gives up. > > Is there some more fundamental reason this query couldn’t correctly be completed? I don’t mind being enlightened aboutthe part that I’m missing. If the answer here is just that you’d rather it always fail at planning time because that’s more deterministic than havingit sometimes succeed and sometimes fail at runtime depending on which data has been loaded, ok, I can understand that. If so, then let’s put this error string into the docs, because right now, if you google site:postgresql.org "could not determine which collation to use for string hashing” you don’t get anything from the docs telling you that this is an expected outcome. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Mark, On Wed, Jan 29, 2020 at 1:03 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > On Jan 28, 2020, at 7:38 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > >> Mark Dilger <mark.dilger@enterprisedb.com> writes: > >>> While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than respondingto that patch, I’m starting this new thread. > >>> I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partitionvs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list. Just to clarify, we only intended in the quoted thread to plug relevant holes of the *partitioning* code, which IIRC was more straightforward to do than appears to be the case here. > If the answer here is just that you’d rather it always fail at planning time because that’s more deterministic than havingit sometimes succeed and sometimes fail at runtime depending on which data has been loaded, ok, I can understand that. If so, then let’s put this error string into the docs, because right now, if you google > > site:postgresql.org "could not determine which collation to use for string hashing” > > you don’t get anything from the docs telling you that this is an expected outcome. You may have noticed that it's not only hash join that bails out: EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); QUERY PLAN ------------------------------------------------------------ Hash Join Hash Cond: ((t2.a)::text = (t1.a)::text) -> Seq Scan on beta t2 -> Hash -> Seq Scan on alpha t1 Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) (6 rows) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); ERROR: could not determine which collation to use for string hashing HINT: Use the COLLATE clause to set the collation explicitly. SET enable_hashjoin TO off; -- error occurs partway through ExecInitMergeJoin(), so EXPLAIN can't finish EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); ERROR: could not determine which collation to use for string comparison HINT: Use the COLLATE clause to set the collation explicitly. SET enable_mergejoin TO off; EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); QUERY PLAN ------------------------------------------------------------ Nested Loop Join Filter: ((t1.a)::text = (t2.a)::text) -> Seq Scan on beta t2 -> Materialize -> Seq Scan on alpha t1 Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) (6 rows) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); ERROR: could not determine which collation to use for string comparison HINT: Use the COLLATE clause to set the collation explicitly. With PG 11, I can see that hash join and nestloop join work. But with PG 12, this join can't possible work without an explicit COLLATE clause. So it would be nice if we can report a more specific error much sooner, possibly with some parser context, given that we now know for sure that a join qual without a collation assigned will not work at all. IOW, maybe we should aim for making the run-time collation errors to be of "won't happen" category as much as possible. Tom said: > >> Now, I'd be the first to agree that this error could be reported better. > >> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but > >> what it does *not* know is whether the '=' operator cares for collation. > >> Throwing an error when the operator wouldn't care at runtime isn't going > >> to make many people happy. On the other hand, when the operator finally > >> does run and can't get a collation, all it knows is that it didn't get a > >> collation, not why. So we can't produce an error message as specific as > >> "ja_JP and tr_TR collations conflict". > >> > >> Now that the collations feature has settled in, it'd be nice to go back > >> and see if we can't improve that somehow. Not sure how. Would it make sense to catch a qual with unassigned collation somewhere in the planner, where the qual's operator family is estatblished, by checking if the operator family behavior is sensitive to collations? Thanks, Amit
> On Jan 29, 2020, at 10:14 PM, Amit Langote <amitlangote09@gmail.com> wrote: > > > SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) > WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); > ERROR: could not determine which collation to use for string comparison > HINT: Use the COLLATE clause to set the collation explicitly. > > With PG 11, I can see that hash join and nestloop join work. But with > PG 12, this join can't possible work without an explicit COLLATE > clause. So it would be nice if we can report a more specific error > much sooner, possibly with some parser context, given that we now know > for sure that a join qual without a collation assigned will not work > at all. IOW, maybe we should aim for making the run-time collation > errors to be of "won't happen" category as much as possible. > > Tom said: >>>> Now, I'd be the first to agree that this error could be reported better. >>>> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but >>>> what it does *not* know is whether the '=' operator cares for collation. >>>> Throwing an error when the operator wouldn't care at runtime isn't going >>>> to make many people happy. On the other hand, when the operator finally >>>> does run and can't get a collation, all it knows is that it didn't get a >>>> collation, not why. So we can't produce an error message as specific as >>>> "ja_JP and tr_TR collations conflict". >>>> >>>> Now that the collations feature has settled in, it'd be nice to go back >>>> and see if we can't improve that somehow. Not sure how. > > Would it make sense to catch a qual with unassigned collation > somewhere in the planner, where the qual's operator family is > estatblished, by checking if the operator family behavior is sensitive > to collations? Hi Amit, I appreciate your attention to my question, but I’m not ready to delve into possible fixes, as I still don’t entirelyunderstand the problem. According to Tom: > (BTW, before v12 the text '=' operator indeed did not care for collation, > so this example would've worked. But the change in behavior is a > necessary consequence of having invented nondeterministic collations, > not a bug.) I’m still struggling with that, because the four collations I used in the example are all deterministic. I totally understandwhy having more than one collation matters if you ask that your data be in sorted order, as the system needs toknow which ordering to use. But for equality, I would think that deterministic collations are all interchangeable, becausethey all agree on whether A = B, regardless of the collation defined on column A and/or on column B. Maybe I’m wrongabout that. But that’s my reading of the definition of “deterministic collation” given in the docs: > A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if theyconsist of the same byte sequence. I’m reading that as “If and only if”, and maybe I’m wrong to do so. Maybe that’s my error. But assuming that part is ok,it would seem to be sufficient to know that the columns being joined use deterministic collations, and you wouldn’t needthem to be the *same* collations, nor even remember which collations they were. You’d just need information passed downthat collations can be ignored for this comparison, or that a built-in byte-for-byte equality comparator should be usedrather than the collation’s equality comparator, or some such solution. I’m guessing I’m wrong about at least one of these things, and I’m hoping somebody enlightens me. Thanks so much in advance, — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>> Would it make sense to catch a qual with unassigned collation >> somewhere in the planner, where the qual's operator family is >> estatblished, by checking if the operator family behavior is sensitive >> to collations? I’m not sure how to do that. pg_opfamily doesn’t seem to have a field for that. Can you recommend how I would proceed there? There may be operators other than = and != that are worth thinking about, but for now I’m really only thinking about thosetwo. It is hard for me to see how we could ignore a collation mismatch for any other operator, but maybe I’m just notthinking about it the right way. So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first thingsit does. That’s where the error that annoys me comes out. But I don’t think it really needs to be doing this. Itshould first be determining if collation *matters*. It can’t do that right now, because it gets that information fromthis line: > if (lc_collate_is_c(collid) || > collid == DEFAULT_COLLATION_OID || > pg_newlocale_from_collation(collid)->deterministic) Which obviously won’t work if collid hasn’t been set. So three approaches come to my mind: 1) Somewhere upstream from calling texteq, figure out that we don’t actually care about the collation stuff, because boththe left and right side of the comparison use deterministic collations and the comparison we’re calling is equality,and then pass down a dummy collation such as “C” even though that isn’t actually true. The problem with (1) that I see is that it could lead to the wrong collation being mentioned in error messages, though Ihaven’t looked, and that it’s enough of a hack that it might make coding in this area harder in the future. 2) Somewhere upstream from calling texteq, pass in a new boolean flag that specifies whether collation matters, and extendtexteq to take an additional argument. This also seems very hacky to me, but for different reasons. 3) Extend the concept of collations to collation sets. Right now, I’m only thinking about a collation set as having twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get builtup and need to work. Anyway, at the point in the executor that the collations don’t match, instead of passing NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with twodifferent collations and decide if they can deal with that or if they need to throw an error. I bet if we went with (3), the error being thrown in the example I used to start this thread would go away, without breakinganything else. I’m going to go poke at that a bit, but I’d still appreciate any comments/concerns about my analysis. > According to Tom: > >> (BTW, before v12 the text '=' operator indeed did not care for collation, >> so this example would've worked. But the change in behavior is a >> necessary consequence of having invented nondeterministic collations, >> not a bug.) > > I’m still struggling with that, because the four collations I used in the example are all deterministic. I totally understandwhy having more than one collation matters if you ask that your data be in sorted order, as the system needs toknow which ordering to use. But for equality, I would think that deterministic collations are all interchangeable, becausethey all agree on whether A = B, regardless of the collation defined on column A and/or on column B. Maybe I’m wrongabout that. But that’s my reading of the definition of “deterministic collation” given in the docs: > >> A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if theyconsist of the same byte sequence. > > I’m reading that as “If and only if”, and maybe I’m wrong to do so. Maybe that’s my error. But assuming that part isok, it would seem to be sufficient to know that the columns being joined use deterministic collations, and you wouldn’tneed them to be the *same* collations, nor even remember which collations they were. You’d just need informationpassed down that collations can be ignored for this comparison, or that a built-in byte-for-byte equality comparatorshould be used rather than the collation’s equality comparator, or some such solution. I’m starting to think that “consequence of having invented nondeterministic collations” in Tom’s message really should read“consequence of having invented nondeterministic collations without reworking these other interfaces”, but once again,I’m hoping to be corrected if I’ve gone off in the wrong direction here. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > According to Tom: >> (BTW, before v12 the text '=' operator indeed did not care for collation, >> so this example would've worked. But the change in behavior is a >> necessary consequence of having invented nondeterministic collations, >> not a bug.) > I’m still struggling with that, because the four collations I used in > the example are all deterministic. I totally understand why having more > than one collation matters if you ask that your data be in sorted order, > as the system needs to know which ordering to use. But for equality, I > would think that deterministic collations are all interchangeable, > because they all agree on whether A = B, regardless of the collation > defined on column A and/or on column B. Maybe I’m wrong about that. Well, you're not wrong, but you're assuming much finer distinctions than the collation machinery actually makes (or than it'd be sane to ask it to make, IMO). We don't have a way to tell texteq that "well, we don't know what collation to assign to this operation, but it's okay to assume that it's deterministic". Nor does the parser have any way to know that texteq could be satisfied by that knowledge --- if it doesn't even know whether texteq cares about collation, how could it know that? There are other issues here too. Just because the query could theoretically be implemented without reference to any specific collation doesn't mean that that's a practical thing to do. It'd be unclear for instance whether we can safely use indexes that *do* have specific collations attached. We'd also lose the option to consider plans like mergejoins. If the parser understood that a particular operator behaved like text equality --- which it does not, and I guarantee you I will shoot down any proposal to hard-wire a parser test for that particular operator --- you could imagine assigning "C" collation when we have an unresolvable combination of deterministic collations for the inputs. That dodges the problem of not having any way to represent the situation. But it's still got implementation issues, in that such a collation choice probably won't match the collations of any indexes for the input columns. Another issue is that collations "bubble up" in the parse tree, so sneaking in a collation that's not supposed to be there per spec carries a risk of causing unexpected semantics further up. I think we could get away with that for the particular case of equality (which returns collation-less boolean), but this is another thing that makes the case narrower and less useful. In the end, TBH, I'm not finding your example compelling enough to be worth putting in weird hacks for such cases. If you're joining columns of dissimilar collations, you're going to be finding it necessary to specify what collation to use in a lot of places ... so where's the value in making a weird special case for equality? regards, tom lane
On Thu, Jan 30, 2020 at 2:44 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > 3) Extend the concept of collations to collation sets. Right now, I’m only thinking about a collation set as having twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get builtup and need to work. Anyway, at the point in the executor that the collations don’t match, instead of passing NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with twodifferent collations and decide if they can deal with that or if they need to throw an error. > > I bet if we went with (3), the error being thrown in the example I used to start this thread would go away, without breakinganything else. I’m going to go poke at that a bit, but I’d still appreciate any comments/concerns about my analysis. I assume that what would have to happen to implement this is that an SQL-callable function would be passed more than one collation OID, perhaps one per argument or something like that. Notice, however, that this would require changing the way that functions get called. See the DirectFunctionCall{1,2,3,...}Coll() and FunctionCall{0,1,2,3,...}Coll() and the definition of FunctionCallInfoBaseData -- there's only one spot for an OID available right now. Allowing for more would likely have a noticeable impact on the cost of calling SQL-callable functions, and that's already expensive enough that people have been unhappy about it. It seems unlikely that it would be worth incurring more overhead here for every query all the time just to make this case work. It is, perhaps, a little strange that the only two choices for an operator are "cares about collation" and "doesn't," and I somehow feel like there ought to be a way to do better. But I don't know what it is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jan 30, 2020, at 12:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Mark Dilger <mark.dilger@enterprisedb.com> writes: >> According to Tom: >>> (BTW, before v12 the text '=' operator indeed did not care for collation, >>> so this example would've worked. But the change in behavior is a >>> necessary consequence of having invented nondeterministic collations, >>> not a bug.) > >> I’m still struggling with that, because the four collations I used in >> the example are all deterministic. I totally understand why having more >> than one collation matters if you ask that your data be in sorted order, >> as the system needs to know which ordering to use. But for equality, I >> would think that deterministic collations are all interchangeable, >> because they all agree on whether A = B, regardless of the collation >> defined on column A and/or on column B. Maybe I’m wrong about that. > > Well, you're not wrong, but you're assuming much finer distinctions > than the collation machinery actually makes (or than it'd be sane > to ask it to make, IMO). We don't have a way to tell texteq that > "well, we don't know what collation to assign to this operation, > but it's okay to assume that it's deterministic". Nor does the > parser have any way to know that texteq could be satisfied by > that knowledge --- if it doesn't even know whether texteq cares > about collation, how could it know that? I agree. Having this in the parser seems really weird and unwholesome. > There are other issues here too. Just because the query could > theoretically be implemented without reference to any specific > collation doesn't mean that that's a practical thing to do. > It'd be unclear for instance whether we can safely use indexes > that *do* have specific collations attached. We'd also lose > the option to consider plans like mergejoins. > > If the parser understood that a particular operator behaved > like text equality --- which it does not, and I guarantee you > I will shoot down any proposal to hard-wire a parser test for > that particular operator --- you could imagine assigning "C" > collation when we have an unresolvable combination of > deterministic collations for the inputs. That dodges the > problem of not having any way to represent the situation. > But it's still got implementation issues, in that such a > collation choice probably won't match the collations of any > indexes for the input columns. Yeah, I disclaimed that idea in a subsequent email, but if you’re responding to my emails in the order that you receive them(which is totally reasonable), then you aren’t to know that yet. > > Another issue is that collations "bubble up" in the parse tree, > so sneaking in a collation that's not supposed to be there per > spec carries a risk of causing unexpected semantics further up. > I think we could get away with that for the particular case of > equality (which returns collation-less boolean), but this is > another thing that makes the case narrower and less useful. I was wondering if bubbling up (LeftCollation,RightCollation) would be ok. There are likely cases that can’t make use ofthat, but those places would just throw the same sort of error that they’re currently throwing, except they’d have a moreuseful error message because it would include which collations were mismatched. > In the end, TBH, I'm not finding your example compelling enough > to be worth putting in weird hacks for such cases. If you're > joining columns of dissimilar collations, you're going to be > finding it necessary to specify what collation to use in a lot > of places ... so where's the value in making a weird special > case for equality? I agree with your position against weird hacks. If the only way to do this is a weird hack, then forget about it. If I’m not putting upon your time too much, could you respond to my other email in this thread as to whether it sounds anybetter? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > Would it make sense to catch a qual with unassigned collation > somewhere in the planner, where the qual's operator family is > estatblished, by checking if the operator family behavior is sensitive > to collations? > I’m not sure how to do that. pg_opfamily doesn’t seem to have a field for that. Can you recommend how I would proceedthere? There's no such information attached to opfamilies, which is more or less forced by the fact that individual operators don't expose it either. There's not much hope of making that better without incompatible changes in the requirements for extensions to define operators and/or operator families. > So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first thingsit does. That’s where the error that annoys me comes out. But I don’t think it really needs to be doing this. Itshould first be determining if collation *matters*. But of course it matters. How do you know whether the operation is deterministic if you don't know the collation? > 3) Extend the concept of collations to collation sets. Right now, I’m only thinking about a collation set as having twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get builtup and need to work. Anyway, at the point in the executor that the collations don’t match, instead of passing NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with twodifferent collations and decide if they can deal with that or if they need to throw an error. Maybe this could work. I think it would get messy when bubbling up collations, but as long as you're talking about "sets" not "pairs" it might be possible to postpone collation resolution. To me, though, the main advantage of this is that we could throw a more explicit error like "collations "ja_JP" and "tr_TR" cannot be unified", since that information would still be there at runtime. I'm still pretty dubious that having texteq special-case the situation where the collations are different but all deterministic is a reasonable thing to do. One practical problem is that postponing that work to runtime could be a huge performance hit, because you'd have to do it over again on each call of the operator. I suppose some caching might be possible. Another issue is that you're still putting far too much emphasis on the fact that a hash-join plan manages to avoid this error, and ignoring the problem that a lot of other plans for the same query will not avoid it. What if the planner had chosen a merge-join, for instance? How useful is it to allow the join if things still break the moment you add an ORDER BY? regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > I assume that what would have to happen to implement this is that an > SQL-callable function would be passed more than one collation OID, > perhaps one per argument or something like that. Notice, however, that > this would require changing the way that functions get called. See the > DirectFunctionCall{1,2,3,...}Coll() and > FunctionCall{0,1,2,3,...}Coll() and the definition of > FunctionCallInfoBaseData -- there's only one spot for an OID available > right now. Allowing for more would likely have a noticeable impact on > the cost of calling SQL-callable functions, and that's already > expensive enough that people have been unhappy about it. It seems > unlikely that it would be worth incurring more overhead here for every > query all the time just to make this case work. The implementation I was visualizing was replacing, eg, FuncExpr.inputcollid with an OID List, and then teaching PG_GET_COLLATION to throw an error if the list is longer than one element. I agree that the performance implications of that would be pretty troublesome, though. In the end, it seems like the only solution that would be remotely practical from a performance standpoint is to redefine things so that collation-sensitive functions have to be labeled as such in pg_proc, and then we can have the parser throw the appropriate error if it can't resolve an input collation for such a function. Perhaps the backwards-compatibility hit wouldn't be as bad as it first seems, since the whole thing can be ignored for functions that haven't got at least one collatable input, and most of those would likely be all right with a default assumption that they are collation sensitive. Or maybe better, we could make the default assumption be that they aren't sensitive, with the same error still being thrown at runtime if they are, so that extensions have to take positive action to get the better error behavior but if they don't then things are no worse than today. Mark, obviously, would then lobby for the pg_proc marking to include one state that identifies functions that only care about collation when it's nondeterministic. But I'm still not very sure how that would work as soon as you look anyplace except at what texteq() itself would do. The questions of whether such a query matches a given index, or could be implemented via mergejoin, etc, remain. regards, tom lane
> On Jan 30, 2020, at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Mark Dilger <mark.dilger@enterprisedb.com> writes: >> Would it make sense to catch a qual with unassigned collation >> somewhere in the planner, where the qual's operator family is >> estatblished, by checking if the operator family behavior is sensitive >> to collations? > >> I’m not sure how to do that. pg_opfamily doesn’t seem to have a field for that. Can you recommend how I would proceedthere? > > There's no such information attached to opfamilies, which is more or less > forced by the fact that individual operators don't expose it either. > There's not much hope of making that better without incompatible changes > in the requirements for extensions to define operators and/or operator > families. Thanks, Tom, for confirming this. Given the excellent explanations you and Robert have given, I think I’m retracting this whole idea and accepting your positionsthat it’s not worth it. For the archives, I’m still going to respond to the rest of what you say: >> So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first thingsit does. That’s where the error that annoys me comes out. But I don’t think it really needs to be doing this. Itshould first be determining if collation *matters*. > > But of course it matters. How do you know whether the operation is > deterministic if you don't know the collation? > >> 3) Extend the concept of collations to collation sets. Right now, I’m only thinking about a collation set as having twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get builtup and need to work. Anyway, at the point in the executor that the collations don’t match, instead of passing NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with twodifferent collations and decide if they can deal with that or if they need to throw an error. > > Maybe this could work. I think it would get messy when bubbling up > collations, but as long as you're talking about "sets" not "pairs" > it might be possible to postpone collation resolution. > > To me, though, the main advantage of this is that we could throw a > more explicit error like "collations "ja_JP" and "tr_TR" cannot be > unified", since that information would still be there at runtime. > I'm still pretty dubious that having texteq special-case the situation > where the collations are different but all deterministic is a reasonable > thing to do. On my mac, when I run “SELECT * FROM pg_collation”, every one of the 271 rows I get back have collisdeterministic true. Iknow that which collations you get on a system is variable, so I’m not saying that nobody has nondeterministic collations,but it seems common enough that mismatched collations will both be deterministic. That’s the common case, notsome weird edge case. So the issue here seems to be whether equality should get different treatment from other operators, and I obviously am arguingthat it should, though you and Robert have both made really good points against that position. > > One practical problem is that postponing that work to runtime could be > a huge performance hit, because you'd have to do it over again on each > call of the operator. I suppose some caching might be possible. Yes, Robert mentioned performance implications, too. > Another issue is that you're still putting far too much emphasis on > the fact that a hash-join plan manages to avoid this error, and ignoring > the problem that a lot of other plans for the same query will not avoid > it. What if the planner had chosen a merge-join, for instance? You’re looking at the problem from the point of view of how postgres is currently and historically implemented, and seeingthat this problem is hard. I was looking at it more from the perspective of a user who gets the error message andthinks, “this is stupid, the query is refusing to run for want of a collation being specified, but I can clearly see thatit doesn’t actually need one.” I think that same reaction from the user would happen if the planner chose a merge-join. The user would just say, “gee, what a stupid planner, why did it choose a merge join for this when the lack ofa collation clause clearly indicates that it should have limited itself to something that only needs equality comparison.” I’m not calling the planner stupid, nor the system generally, but I know that people get frustrated with systems that haveunintuitive limitations like this when they don’t know the internals of the system that give rise to the limitations. > How > useful is it to allow the join if things still break the moment you > add an ORDER BY? I think that’s apples-to-oranges. If you ask the system to order the data, and you’ve got ambiguity about which orderingyou mean, then of course it can’t continue until you tell it which collation you want. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 31, 2020 at 6:15 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > On Jan 30, 2020, at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Mark Dilger <mark.dilger@enterprisedb.com> writes: > >> Would it make sense to catch a qual with unassigned collation > >> somewhere in the planner, where the qual's operator family is > >> estatblished, by checking if the operator family behavior is sensitive > >> to collations? > > > >> I’m not sure how to do that. pg_opfamily doesn’t seem to have a field for that. Can you recommend how I would proceedthere? > > > > There's no such information attached to opfamilies, which is more or less > > forced by the fact that individual operators don't expose it either. > > There's not much hope of making that better without incompatible changes > > in the requirements for extensions to define operators and/or operator > > families. > > Thanks, Tom, for confirming this. Just for the record, I will explain why I brought up doing anything with operator families at all. What I was really imagining is putting a hard-coded check somewhere in the middle of equivalence processing to see if a given qual's operator would be sensitive to collation based *only* on whether it belongs to a text operator family, such as TEXT_BTREE_FAM_OID, whereas the qual expression's inputcollid is 0 (parser failed to resolve collation conflict among its arguments) and erroring out if so. If we do that, maybe we won't need check_collation_set() that's used in various text operators. Also, erroring out sooner might allow us to produce more specific error message, which as I understand it, would help with one of the Mark's complaints that the error message is too ambiguous due to emitted as such a low layer. I thought of the idea after running into a recent commit relevant to non-deterministic collations: commit 2810396312664bdb941e549df7dfa75218d73a1c Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Sat Sep 21 16:29:17 2019 -0400 Fix up handling of nondeterministic collations with pattern_ops opclasses. text_pattern_ops and its siblings can't be used with nondeterministic collations, because they use the text_eq operator which will not behave as bitwise equality if applied with a nondeterministic collation. The initial implementation of that restriction was to insert a run-time test in the related comparison functions, but that is inefficient, may throw misleading errors, and will throw errors in some cases that would work. It seems sufficient to just prevent the combination during CREATE INDEX, so do that instead. Lacking any better way to identify the opclasses involved, we need to hard-wire tests for them, which requires hand-assigned values for their OIDs, which forces a catversion bump because they previously had OIDs that would be assigned automatically. That's slightly annoying in the v12 branch, but fortunately we're not at rc1 yet, so just do it. Discussion: https://postgr.es/m/22566.1568675619@sss.pgh.pa.us IIUC, the above commit removes a check_collation_set() call from a operator class comparison function in favor of ensuring that an index using that operator class can only be defined with a deterministic collation in the first place. But as the above commit is purportedly only a stop-gap solution due to lacking operator class infrastructure to consider collation in operator semantics, maybe we shouldn't spread such a hack in other places. Thanks, Amit