Thread: Wrong Results from SP-GiST with Collations
Please see: > # create table t (a text collate "cs_CZ"); > CREATE TABLE > > # insert into t values ('c'), ('ch'); > INSERT 0 2 > > # select * from t where a < 'd'; > a > --- > c > (1 row) > > xx2_game=# create index on t using spgist (a); > CREATE INDEX > > xx2_game=# set enable_seqscan = 0; > SET > > xx2_game=# select * from t where a < 'd'; > a > ---- > c > ch > (2 rows) I assume we don't have any option than removing support for comparison operators with collations just like btree. The bug may apply to the other access methods.
On Tue, Apr 10, 2018 at 9:21 AM, Emre Hasegeli <emre@hasegeli.com> wrote: > I assume we don't have any option than removing support for comparison > operators with collations just like btree. I can't imagine how this could ever work, since a non C-collation string cannot be atomized into tokens like that. Considering each token successively during a comparison does not always produce the same answer as a straight comparison against the original string would. I believe that there are cases that won't work with *any* glibc or ICU collation, not just "cs_CZ". -- Peter Geoghegan
> I can't imagine how this could ever work, since a non C-collation > string cannot be atomized into tokens like that. Considering each > token successively during a comparison does not always produce the > same answer as a straight comparison against the original string > would. I think the only fix is to remove support for those operators from the operator class. The logic on match_special_index_operator() doesn't seem to need any change. Patch attached. As far as I understand, we cannot back-patch catalog changes. In this case, we can leave it broken and apply the documentation part.
Attachment
On Mon, Apr 16, 2018 at 8:16 AM, Emre Hasegeli <emre@hasegeli.com> wrote: > I think the only fix is to remove support for those operators from the > operator class. The logic on match_special_index_operator() doesn't > seem to need any change. Patch attached. As far as I understand, we > cannot back-patch catalog changes. In this case, we can leave it > broken and apply the documentation part. Has the operator class really been completely broken since SP-GiST was first introduced? I tend to doubt that. spg_text_inner_consistent() has the following code, which appears to acknowledge the problem with non-C collations: for (j = 0; j < in->nkeys; j++) { StrategyNumber strategy = in->scankeys[j].sk_strategy; text *inText; int inSize; int r; /* * If it's a collation-aware operator, but the collation is C, we * can treat it as non-collation-aware. With non-C collation we * need to traverse whole tree :-( so there's no point in making * any check here. (Note also that our reconstructed value may * well end with a partial multibyte character, so that applying * any encoding-sensitive test to it would be risky anyhow.) */ if (SPG_IS_COLLATION_AWARE_STRATEGY(strategy)) { if (collate_is_c) strategy -= SPG_STRATEGY_ADDITION; else continue; } Maybe this is a simple, relatively benign bug. Maybe this problem is due to commit 710d90da, which is only a couple of weeks old. Have you reproduced the problem on v10 of Postgres? -- Peter Geoghegan
> Maybe this is a simple, relatively benign bug. Maybe this problem is > due to commit 710d90da, which is only a couple of weeks old. Have you > reproduced the problem on v10 of Postgres? Yes, the initial example I posted was from 9.5.12 on Debian.
Peter Geoghegan <pg@bowt.ie> writes: > Has the operator class really been completely broken since SP-GiST was > first introduced? I tend to doubt that. spg_text_inner_consistent() > has the following code, which appears to acknowledge the problem with > non-C collations: You're on to something, but I think the bug is in spg_text_leaf_consistent, which thinks it can do collation-aware comparisons like this: r = varstr_cmp(fullValue, Min(queryLen, fullLen), VARDATA_ANY(query), Min(queryLen, fullLen), PG_GET_COLLATION()); That's got nothing to do with reality for non-C collations, and it seems rather pointless anyway, Why isn't this just r = varstr_cmp(fullValue, fullLen, VARDATA_ANY(query), queryLen, PG_GET_COLLATION()); and then the bit below about if (r == 0) { if (queryLen > fullLen) r = -1; else if (queryLen < fullLen) r = 1; } needs to move into the "non-collation-aware" branch. regards, tom lane
On Mon, Apr 16, 2018 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > You're on to something, but I think the bug is in > spg_text_leaf_consistent, which thinks it can do collation-aware > comparisons like this: > > r = varstr_cmp(fullValue, Min(queryLen, fullLen), > VARDATA_ANY(query), Min(queryLen, fullLen), > PG_GET_COLLATION()); Ah. Those arguments make that code completely broken. > and then the bit below about > > if (r == 0) > { > if (queryLen > fullLen) > r = -1; > else if (queryLen < fullLen) > r = 1; > } > > needs to move into the "non-collation-aware" branch. Right. Alternatively, you could actually call varstr_cmp() within the "non-collation-aware" branch. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Apr 16, 2018 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> and then the bit below about ... >> needs to move into the "non-collation-aware" branch. > Right. Alternatively, you could actually call varstr_cmp() within the > "non-collation-aware" branch. True. This way saves a few cycles, but maybe it's not worth the extra code. I think the only case where you could really notice the difference is for an equality search operator, which might end up doing a lot more work in non-C collations (full-blown strcoll vs memcmp). regards, tom lane
On Mon, Apr 16, 2018 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Right. Alternatively, you could actually call varstr_cmp() within the >> "non-collation-aware" branch. > > True. This way saves a few cycles, but maybe it's not worth the extra > code. I think the only case where you could really notice the difference > is for an equality search operator, which might end up doing a lot more > work in non-C collations (full-blown strcoll vs memcmp). I don't have a full understanding of this particular problem, but it sounds to me that that would be a non-issue due to the equality fast-path added to varstr_cmp() several years ago. I microbenchmarked it quite extensively at the time, and concluded that it was all but free in cases where it didn't work out. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Apr 16, 2018 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> True. This way saves a few cycles, but maybe it's not worth the extra >> code. I think the only case where you could really notice the difference >> is for an equality search operator, which might end up doing a lot more >> work in non-C collations (full-blown strcoll vs memcmp). > I don't have a full understanding of this particular problem, but it > sounds to me that that would be a non-issue due to the equality > fast-path added to varstr_cmp() several years ago. I microbenchmarked > it quite extensively at the time, and concluded that it was all but > free in cases where it didn't work out. I'm not following --- varstr_cmp has no way to know that we only care about equality vs inequality. Yes, it might give back an answer quickly when the strings are equal, but when they aren't, it has to decide which is greater. In this case we don't care, so long as the search operator is "=". regards, tom lane
On Mon, Apr 16, 2018 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not following --- varstr_cmp has no way to know that we only > care about equality vs inequality. Yes, it might give back an > answer quickly when the strings are equal, but when they aren't, > it has to decide which is greater. In this case we don't care, > so long as the search operator is "=". That was the subtlety I was missing. I didn't understand what you meant, but thought that the fast-path might be relevant. Clearly not. -- Peter Geoghegan