Re: sortsupport for text - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: sortsupport for text |
Date | |
Msg-id | CAEYLb_Uw4_Z0Vfb8+VR1+iM2zJ-c7YCV=sQRT1J4CXK+sAkR+w@mail.gmail.com Whole thread Raw |
In response to | Re: sortsupport for text (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: sortsupport for text
|
List | pgsql-hackers |
On 18 June 2012 00:38, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The only reason we test "a = b" and not "a < b || a > b" > is that the latter is at least twice as expensive to evaluate. Perhaps I've been unclear. I meant to write "(!(a < b) && !(b < a))", and not "(a < b && b < a)". The former is not tautological when the trichotomy law holds, but the latter is. My mistake - I was a little tired when I wrote that (though you seemed to know what I meant). You can sort things just fine if you only have a less than comparator than returns a boolean, which is what I sought to illustrate with that example: There is no requirement that the comparator indicate anything more than whether one element is less than the other. In particular, it need not tell the sort function that two elements are equal, though the sort function will still (almost invariably in practice) put equal elements beside each other. Consider std::set, a highly-generic template class that enforces uniqueness, usually implemented using a red-black tree (so its elements are sorted), that only requires that the datatype it is instantiated with have an operator<. The datatype used may not even have an operator== for std::set to consult if it wanted to. So, in principle, std::set could figure out if any two given elements were equivalent. That is to say, it could determine: if (!(a < b) && !(b < a)) It could not, even in principle given its implicit interface/requirements for datatypes, know for sure that: if (a == b) Despite the fact that equivalency and equality can be expected to coincide the vast majority of the time, they are not quite the same thing. It's perfectly reasonable that it might actually not hold that a and b are equal, even though they are equivalent, for certain datatypes. Certainly not for integers. But, for example, it could make sense to have a string datatype where with a certain locale a certain Hungarian word was equivalent to the same string with a minor variation in diacritical marks or whatever (an entirely culturally defined equivalency), even though it was not equal according to this datatype's own idiosyncratic notion of equality, which is bitwise equality. That would be a datatype that std::set would be entirely capable of rolling with, because it doesn't care about equality. It might be better if our text type had the same semantics as this hypothetical string datatype, because: 1. A culture has gone so far as to make these variations of diacritical marks or whatever equivalent. The Unicode consortium consulted with the Hungarian government, or maybe the Hungarian version of the Académie française, and they asked for this, which seems pretty serious to me. Maybe they'd like to have a unique constraint reject what they consider to be equivalent strings. If this sounds pedantic to you, google "Han unification controversy", because this sounds like that in reverse (Hungarian separation controversy). The fact that when a unique constraint is violated, it isn't actually necessary to check the equality operator function (equivalence will do; no need to call texteq as when selecting with the same value in the qual) suggests that this wouldn't be too hard to facilitate, though again, I admit my understanding here is more shallow than I'd like. Now, I'll grant you that I'm not currently losing any sleep about our cultural insensitivity, and apparently neither are any Hungarians; this just serves to illustrate that my position is The Right Thing (I hope). Here is a description of the Hungarian problem: http://blogs.msdn.com/b/michkap/archive/2005/08/10/449909.aspx It says " the feature is such that since dzs is a compression within Hungarian, that if you see ddzs it should be treated as equivalent to dzsdzs...Basically, the feature is such that since dzs is a compression within Hungarian, that if you see ddzs it should be treated as equivalent to dzsdzs.". We're not the first people to have to deal with this: http://bugs.mysql.com/bug.php?id=12519 Here is Tom's original analysis, that lead to this patch: http://archives.postgresql.org/pgsql-general/2005-12/msg00803.php 2. This way, it's still possible to do the strxfrm() optimisation more or less for free, without breaking hash methods on text. That is not to be sniffed at - sorting text is very common and important. Most people don't know about the C locale, and couldn't use it if they did. > The last section of src/backend/access/nbtree/README has some notes > that you might find relevant. I assume that you're referring to "Notes to Operator Class Implementors". It says: """ An = operator must be an equivalence relation; that is, for all non-null values A,B,C of the datatype: A = A is true reflexive lawif A = B, then B = A symmetric lawif A = B and B = C,then A = C transitive law """ This would still hold; we'd just have to change the = operators here to something representing equivalency, I suppose (or, more likely, don't bring up equivalency .Vs equality here for pedagogical reasons). If it didn't hold, than it would be impossible to write a strxfrm() implementation, since the two variations of the textually distinct though equivalent Hungarian strings would produce identical strxfrm() blobs that were on the one hand bitwise identical, but on the other not reflexive, symmetric or transitive when bitwise compared. This is obviously a logical contradiction. So, we'd still be using a comparison that returns an integer under the hood, but we'd interpret 0 as equivalency and not equality. >> Simple question: if you were to just remove the strcmp tie-breaker for >> () in varstr_cmp(), but not touch anything else, would Postgres >> exhibit objectively incorrect behaviour? > > Yes, it would, and did, before we put that in; see the archives for the > discussions that led up to the patch you mentioned earlier. It wouldn't be the same though, because you also changed the definition of texteq, textne, bpchareq, bpcharne in those commits (to just use strncmp(), not varstr_cmp()), and I am not suggesting that you consider changing them back too. Incidentally, I should mention that I tried removing the strcmp() tie-breaker in varstr_cmp() on master, and unsurprisingly the regression tests don't break. Now, there might be a better way of explaining all of this that is more clear and succinct, and that's what I'll attempt to do: tuplesort.c tells lies to the btree code: That two index tuples that are logically equal are not. Now, the btree code doesn't "repeat" those lies to anyone else, like the user, which is unsurprising because that would lead to incorrect answers. Like all good liars, it is consistent in the lies that it tells (because the lies are derived from the item-pointers of index tuples, which are used as a tie-breaker). I'm tentatively suggesting that it might be to our advantage to tell a similar though different lie to the btree code that it would never have to know about, a lie originating higher-up and reverberating more places, within the comparator proper ( that is, varstr_cmp() ), rather than the index-tuple encapsulating comparator within tuplesort (that is, comparetup_index_btree() ): That two index tuples are equal, when in-fact it might just be that they're equivalent. Now, for most people, this is exactly the same thing since every datatype except text doesn't have a separate notion of equivalence, and text basically only has that notion when using certain locales like hu_HU.UTF-8 (hardly surprising that the regression tests didn't fail on me). The Hungarians get to have uniqueness enforced for textually distinct variants that strcoll() returns 0 for (equivalent not equal values) - in other words, they get what they asked for. However, they still must specify the correct exact variation in predicates, since the equality operator only cares about bitwise equality, which is consistent with general Postgres convention. I believe that that could be the most correct behaviour for them. So, the original complainant saw this behaviour, a clear violation of the trichotomy law: mage=# select 'potyty'::varchar = 'potty'::varchar;?column? ----------f (1 row) mage=# select 'potyty'::varchar < 'potty'::varchar;?column? ----------f (1 row) mage=# select 'potyty'::varchar > 'potty'::varchar;?column? ----------f (1 row) The thing is that I think that the above user-visible behaviour might actually be desirable. 'potyty'::varchar is* not* bitwise equal (our general definition of text equality) to 'potty'::varchar. So, suppose there was an equivalency operator, say ===, all would be right with the world: mage=# select 'potyty'::varchar === 'potty'::varchar;?column? ----------t (1 row) mage=# select 'potyty'::varchar < 'potty'::varchar;?column? ----------f (1 row) mage=# select 'potyty'::varchar > 'potty'::varchar;?column? ----------f (1 row) The reason everything wasn't right with the world, and this example got screwed up for the complainant at the time was that texteq() still returned false via the len1 != len2 fastpath - otherwise Tom's testcase here would not have shown a problem. So at the time and under the exact circumstances, the '=' operator was actually equality-orientated (if only by accident, because len1 != len2 here), whereas varstr_cmp() was entirely equivalency-orientated. So that explains this test-case of Tom's. It does not explain the confusing behaviour that led to the complaint though, since few people are in the habit of verifying that the trichotomy law holds just for the fun of it. online=# select * from common_logins where username = 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum -----+----------+----------+-----------+--------+----------+---------- (0 rows) online=# select * from common_logins where username like 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum --------+----------+----------+----------------------------+--------+----------+---------- 155505 | potyty | board | 2004-08-16 17:45:55.723829 | A | S | 1 60067 | potyty | board | 2004-07-07 20:22:17.68699 | A | S | 3 174041 | potyty | board | 2005-02-17 00:00:13.706144 | A | S | 3 (3 rows) online=# select username, username = 'potyty' from common_logins where username like 'potyty'; username | ?column? ----------+---------- potyty | t potyty | t potyty | t (3 rows) That isn't something that I've been able to figure out yet. I think that maybe an index got corrupted because Incidentally, I should point out that right now this code appears in like_match.c: MatchText(char *t, int tlen, char *p, int plen, pg_locale_t locale, bool locale_is_c) The only way that the locale and locale_is_c are used within this function is via a recursive call to MatchText() - they're entirely vestigial, it seems. Now, again I must admit that it isn't obvious to me that my white lie to the btree code about something being equal that is actually just equivalent won't come back to haunt me - lies have short legs. On the face of it though, it looks like I might get away with it, since equality in a qual is re-checked when I execute a select statement, but isn't checked when I attempt to violate a unique constraint (a behavioural difference which the Hungarians will be happy about, since they took the unusual step of representing that the strings 'potyty' and 'potty' were exactly equivalent to the appropriate authorities). Perhaps more importantly, I cannot recreate any of these problems on my Fedora 16 machine. Even with hu_HU on LATIN2, Tom's original test case (from 2005, on a Fedora 4 machine) cannot be recreated. So it may be that they've tightened these things up in some way. It's far from clear why that should be. It could be worth -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
pgsql-hackers by date: