Thread: Building infrastructure for B-Tree deduplication that recognizes whenopclass equality is also equivalence
Building infrastructure for B-Tree deduplication that recognizes whenopclass equality is also equivalence
From
Peter Geoghegan
Date:
Anastasia's nbtree deduplication patch [1] has an open problem that I would like to find a solution for: it currently assumes that there is no difference between binary equality and opclass equality. That won't work for opclasses such as btree/numeric, because compressing equal numeric datums could destroy display scale if equal numeric datums were naively lumped together (actually, the deduplication patch doesn't work like that, but it has other subtle problems due to not having worked out fundamental definitional issues). We don't need to be able to assume that binary equality is exactly the same thing as opclass equality at the level of individual tuples. We only need to be able to assume that the user cannot observe any differences when they are shown output for two datums that are opclass-equal for any opclass that supports deduplication (i.e. cases like the numeric_ops case just won't work, so we shouldn't event try). I believe that it would be okay if we treated two IndexTuples as equivalent and therefore targets to store together in the same posting list when they happen to have distinct binary representations due to the original datums having different TOAST input state. In short, the deduplication patch cannot tolerate being unable to store opclass-equal IndexTuples in the same posting list when the opclass (or the underlying type being indexed) somehow allows that equality isn't equivalence -- that's simply unsupportable. The opclass gets one chance to say whether or not it vetoes the use of deduplication: at CREATE INDEX time. Consumers of this new infrastructure probably won't be limited to the deduplication feature; the same infrastructure will be needed for a B-Tree prefix compression patch (I'm thinking of a configurable CREATE INDEX prefix compression feature). GIN never had to solve this problem because its indexes are always lossy, and cannot support index-only scans. It seems likely that a scheme like the one I have in mind can work for the vast majority of Postgres B-Tree indexes in practice, so I don't think that the user-visible restrictions I'm considering will make the patch significantly less useful (it's already very useful). The most notable restriction for users will almost certainly be not supporting deduplication within indexes that use nondeterministic collations. They were already paying a performance penalty during hashing, though. I would like to: * Get some buy-in on whether or not the precise distinctions I would like to make are correct for deduplication in particular, and as useful as possible for other cases that we may need to add later on. * Figure out the exact interface through which opclass/opfamily authors can represent that their notion of equality is compatible with deduplication/compression. (I think that the use of nondeterministic collations should disable deduplication without explicit action from the operator class -- that should just be baked in.) * Mark most existing btree operator classes as being compatible with deduplication as part of making the patch committable. As I said, I believe that their semantics are already compatible with what we need for deduplication to work sensibly, aside from a handful of specific exceptions. In any case, I'm certain that problems like the btree/numeric display scale problem are simply not worth solving directly. That would add a huge amount of complexity for very little benefit. [1] https://commitfest.postgresql.org/24/2202/ -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > We don't need to be able to assume that binary equality is exactly the > same thing as opclass equality at the level of individual tuples. We > only need to be able to assume that the user cannot observe any > differences when they are shown output for two datums that are > opclass-equal for any opclass that supports deduplication (i.e. cases > like the numeric_ops case just won't work, so we shouldn't event try). Hmm, so that would exclude the optimization for numeric, float4/float8, and nondeterministic text collations. Anything else? I agree that teaching opclasses to say whether this is okay is a reasonable approach. > Consumers of this new infrastructure probably won't be limited to the > deduplication feature; Indeed, we run up against this sort of thing all the time in, eg, planner optimizations. I think some sort of "equality is precise" indicator would be really useful for a lot of things. regards, tom lane
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Sun, Aug 25, 2019 at 1:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hmm, so that would exclude the optimization for numeric, float4/float8, > and nondeterministic text collations. Anything else? Any pseudo-type whose output function could possibly be dependent on the output function of another type (in case it happens to be one of the types that definitely aren't safe). Maybe we could make fine distinctions about pseudo-type safety in certain contexts, but that doesn't matter to the deduplication patch. > I agree that teaching opclasses to say whether this is okay is a > reasonable approach. Great. > > Consumers of this new infrastructure probably won't be limited to the > > deduplication feature; > > Indeed, we run up against this sort of thing all the time in, eg, planner > optimizations. I think some sort of "equality is precise" indicator > would be really useful for a lot of things. The case that I happened to think of was "collation strength reduction". In other words, an optimization that has the planner use a merge equijoin whose joinqual involves two text columns using the "C" collation, even though the "C" collation isn't otherwise usable. Perhaps there are far more compelling planner optimization that I haven't considered, though. This idea probably has problems with interesting sort orders that aren't actually that interesting. -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Sun, Aug 25, 2019 at 2:18 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Indeed, we run up against this sort of thing all the time in, eg, planner > > optimizations. I think some sort of "equality is precise" indicator > > would be really useful for a lot of things. > > The case that I happened to think of was "collation strength > reduction". I was thinking of stashing an "equality is precise" flag in the metapage of each nbtree index, since we will only determine this once, at CREATE INDEX time. That would make it fairly natural for the planner to ask about the "equality is precise"-ness of the index at the same point that it calls _bt_getrootheight(): within get_relation_info(). -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > I was thinking of stashing an "equality is precise" flag in the > metapage of each nbtree index, since we will only determine this once, > at CREATE INDEX time. Sure. > That would make it fairly natural for the > planner to ask about the "equality is precise"-ness of the index at > the same point that it calls _bt_getrootheight(): within > get_relation_info(). The planner will almost certainly want to ask the opclass directly, because most of the places where it wants to know this sort of thing about operator behavior have nothing to do with indexes. regards, tom lane
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Sun, Aug 25, 2019 at 2:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I was thinking of stashing an "equality is precise" flag in the > > metapage of each nbtree index, since we will only determine this once, > > at CREATE INDEX time. > > Sure. I suppose that we'd add something new to CREATE OPERATOR CLASS to make this work? My instinct is to avoid adding things that are only meaningful for a single AM to interfaces like CREATE OPERATOR CLASS, but the system already has numerous dependencies on B-Tree opclasses that seem comparable to me. There is a single case where nbtree stores a type that differs from the type actually being indexed by the operator class: the "name" case, where the underlying storage type is actually cstring. I'm not sure whether or not this needs to be treated as its own kind of special case. I suppose that we can ignore it completely, because we're not directly concerned with the physical representation used within an index. In fact, a major goal for this new infrastructure is that nbtree gets to fully own the representation (it just needs to know about the high level or logical requirements). -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Sun, Aug 25, 2019 at 2:55 PM Peter Geoghegan <pg@bowt.ie> wrote: > I suppose that we'd add something new to CREATE OPERATOR CLASS to make > this work? My instinct is to avoid adding things that are only > meaningful for a single AM to interfaces like CREATE OPERATOR CLASS, > but the system already has numerous dependencies on B-Tree opclasses > that seem comparable to me. Another question is whether or not it would be okay to define "equality is precise"-ness to be "the system's generic equality function works perfectly as a drop-in replacement for my own equality operator's function". The system's generic equality function could be the recently added datum_image_eq() function -- that looks like it will do exactly what I have in mind. This would be a new way of using datum_image_eq(), I think, since it wouldn't be okay for it to give an answer that differed from the equality operator's function. It looks like existing datum_image_eq() callers can deal with false negatives (but not false positives, which are impossible). This exceeds what is strictly necessary for the deduplication patch, but it seems like the patch should make comparisons as fast as possible in the context of deduplicating items (it would be nice if it could just use datum_image_eq instead of an insertion scankey when doing many comparisons to deduplicate items). We can imagine a datatype with undefined garbage bytes that affect the answer that datum_image_eq() gives, but could be safe targets for deduplication, so it's not clear if being this aggressive will work. But maybe that isn't actually possible among types that aren't inherently unsafe for deduplication. And maybe we could be more aggressive with optimizations in numerous other contexts by defining "equality is precise"-ness as strict binary equality after accounting for TOAST compression. -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence
From
Antonin Houska
Date:
Peter Geoghegan <pg@bowt.ie> wrote: > Consumers of this new infrastructure probably won't be limited to the > deduplication feature; It'd also solve an open problem of the aggregate push-down patch [1], in particular see the mention of pg_opclass in [2]: the partial aggregate node below the final join must not put multiple opclass-equal values of which are not byte-wise equal into the same group because some information needed by WHERE or JOIN/ON condition may be lost this way. The scale of the numeric type is the most obvious example. > I would like to: > > * Get some buy-in on whether or not the precise distinctions I would > like to make are correct for deduplication in particular, and as > useful as possible for other cases that we may need to add later on. > > * Figure out the exact interface through which opclass/opfamily > authors can represent that their notion of equality is compatible with > deduplication/compression. It's not entirely clear to me whether opclass or opfamily should carry this information. opclass probably makes more sense for index related problems and the aggregate push-down patch can live with that. I don't see particular reason to add any flag to opfamily. (Planner uses uses both pg_opclass and pg_opfamily catalogs.) I think the fact that the aggregate push-down would benefit from this enhancement should affect choice of the new catalog attribute name, i.e. it should be not mention words as concrete as "deduplication" or "compression". > (I think that the use of nondeterministic collations should disable > deduplication without explicit action from the operator class -- that > should just be baked in.) (I think the aggregate push-down needs to consider the nondeterministic collations too, I missed that so far.) [1] https://commitfest.postgresql.org/24/1247/ [2] https://www.postgresql.org/message-id/10529.1547561178%40localhost -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Anastasia Lubennikova
Date:
26.08.2019 14:15, Antonin Houska wrote: > Peter Geoghegan <pg@bowt.ie> wrote: > >> Consumers of this new infrastructure probably won't be limited to the >> deduplication feature; > It'd also solve an open problem of the aggregate push-down patch [1], in > particular see the mention of pg_opclass in [2]: the partial aggregate > node below the final join must not put multiple opclass-equal values of > which are not byte-wise equal into the same group because some > information needed by WHERE or JOIN/ON condition may be lost this > way. The scale of the numeric type is the most obvious example. > >> I would like to: >> >> * Get some buy-in on whether or not the precise distinctions I would >> like to make are correct for deduplication in particular, and as >> useful as possible for other cases that we may need to add later on. >> >> * Figure out the exact interface through which opclass/opfamily >> authors can represent that their notion of equality is compatible with >> deduplication/compression. > It's not entirely clear to me whether opclass or opfamily should carry > this information. opclass probably makes more sense for index related > problems and the aggregate push-down patch can live with that. I don't > see particular reason to add any flag to opfamily. (Planner uses uses > both pg_opclass and pg_opfamily catalogs.) > > I think the fact that the aggregate push-down would benefit from this > enhancement should affect choice of the new catalog attribute name, > i.e. it should be not mention words as concrete as "deduplication" or > "compression". The patch implementing new opclass option is attached. It adds new attribute pg_opclass.opcisbitwise, which is set to true if opclass equality is the same as binary equality. By default it is true. It is set to false for numeric and float4, float8. Does anyarray opclasses need special treatment? New syntax for create opclass is "CREATE OPERATOR CLASS NOT BITWISE ..." Any ideas on better names? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence
From
Antonin Houska
Date:
Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > The patch implementing new opclass option is attached. > > It adds new attribute pg_opclass.opcisbitwise, which is set to true if opclass > equality is the same as binary equality. > By default it is true. I think the default value should be false and we should only set it to true for individual opclasses which do meet the bitwise equality requirement. Also extension authors should explicitly state that their data types are bitwise equal. Otherwise the existing opclasses, when created via pg_dump -> pg_restore, can be used by the system incorrectly. > It is set to false for numeric and float4, float8. Are you sure about these? -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence
From
Andrew Gierth
Date:
>>>>> "Antonin" == Antonin Houska <ah@cybertec.at> writes: >> It is set to false for numeric and float4, float8. Antonin> Are you sure about these? numeric values can compare equal but have different display scales (see hash_numeric). float4 and float8 both have representations for -0, which compares equal to 0. (numeric technically has a representation for -0 too, but I believe the current code carefully avoids ever generating it.) -- Andrew (irc:RhodiumToad)
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Anastasia Lubennikova
Date:
01.10.2019 8:41, Antonin Houska wrote: > Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > >> The patch implementing new opclass option is attached. >> >> It adds new attribute pg_opclass.opcisbitwise, which is set to true if opclass >> equality is the same as binary equality. >> By default it is true. > I think the default value should be false and we should only set it to true > for individual opclasses which do meet the bitwise equality requirement. Also > extension authors should explicitly state that their data types are bitwise > equal. Otherwise the existing opclasses, when created via pg_dump -> > pg_restore, can be used by the system incorrectly. Thank you for the feedback. At first I implemented bitwise as default, because it is more common . Though, I agree that it's essential to avoid false positives here. The new version of the patch is attached. I also updated pg_dump. A few more open questions: 1) How to handle contrib modules that create new opclasses? Since default is 'not bitwise' it means that various operator classes created in extensions such as bloom, btree_gin and others, won't be able to take advantage of various optimizations that require the opclass to be BITWISE. 'v2-Opclass-bitwise-equality-0002' patch simply adds BITWISE keyword where necessary. 2) Whether we should provide ALTER OPERATOR CLASS SET BITWISE syntax? 3) Current patch modifies regression test so that it checks CREATE OPCLASS BITWISE syntax. Is there anything else worth testing? As I see it, this patch is just about infrastructure changes, and more specific tests will be added by features that will implement further optimizations. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Mon, Oct 28, 2019 at 11:11 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > At first I implemented bitwise as default, because it is more common . > Though, I agree that it's essential to avoid false positives here. > The new version of the patch is attached. I also updated pg_dump. > > A few more open questions: > 1) How to handle contrib modules that create new opclasses? > Since default is 'not bitwise' it means that various operator classes > created in extensions > such as bloom, btree_gin and others, won't be able to take advantage of > various optimizations > that require the opclass to be BITWISE. What optimizations? Do we anticipate that other index AMs will benefit from BITWISE-ness? > 'v2-Opclass-bitwise-equality-0002' patch simply adds BITWISE keyword > where necessary. > > 2) Whether we should provide ALTER OPERATOR CLASS SET BITWISE syntax? I think that that's probably not desirable. There should at least be a strong practical advantage if we go that way. This would mean ALTER OPERATOR CLASS could change the "substance" of an opclass, which is fundamentally different from what it can do already (it currently just changes the owner, or the schema that it is stored in). > 3) Current patch modifies regression test so that it checks CREATE > OPCLASS BITWISE syntax. > Is there anything else worth testing? As I see it, this patch is just > about infrastructure changes, > and more specific tests will be added by features that will implement > further optimizations. I think so too -- this is really about associating a single piece of information with an operator class. BTW: No need to bump catversion when posting a patch, which I see in "v2-Opclass-*0001.patch". That is our policy. (A catversion bump is generally supposed to be done at the last minute, just as the patch is committed. This avoids unnecessary conflicts against the master branch over time, as a patch is developed.) -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Anastasia Lubennikova
Date:
14.11.2019 0:25, Peter Geoghegan wrote: > On Mon, Oct 28, 2019 at 11:11 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> At first I implemented bitwise as default, because it is more common . >> Though, I agree that it's essential to avoid false positives here. >> The new version of the patch is attached. I also updated pg_dump. >> >> A few more open questions: >> 1) How to handle contrib modules that create new opclasses? >> Since default is 'not bitwise' it means that various operator classes >> created in extensions >> such as bloom, btree_gin and others, won't be able to take advantage of >> various optimizations >> that require the opclass to be BITWISE. > What optimizations? Do we anticipate that other index AMs will benefit > from BITWISE-ness? I was thinking of possible planner optimizations, that Tom mentioned up thread. Though, I don't have any specific examples. Anyway, we can implement support for user-defined opclasses later. >> 3) Current patch modifies regression test so that it checks CREATE >> OPCLASS BITWISE syntax. >> Is there anything else worth testing? As I see it, this patch is just >> about infrastructure changes, >> and more specific tests will be added by features that will implement >> further optimizations. > I think so too -- this is really about associating a single piece of > information with an operator class. Great. It seems that the patch is ready for commit. I attached new version with pg_opclass documentation update. One more thing I am uncertain about is array_ops. Arrays may contain bitwise and not bitwise element types. What is the correct value of opcisbitwise the array_ops itself? -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Attachment
Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence
From
Antonin Houska
Date:
Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > I attached new version with pg_opclass documentation update. > > One more thing I am uncertain about is array_ops. Arrays may contain bitwise > and not bitwise element types. > What is the correct value of opcisbitwise the array_ops itself? How about setting opcisbitwise to false for the array_ops opclass and checking opcisbitwise of the element type whenever we need to know whether the array is "bitwise equal"? When checking array_eq(), I thought whether the existence of "expanded array" format is a problem but it does not seem to be: the conversion of "expanded" value to "flat" value and then back to the "expanded" should not change the array contents. Anyway, in the current version of the patch I see that array_ops opclasses have opcisbitwise=true. It should be false even if you don't use the approach of checking the element type. Besides that, I think that record_ops is similar to array_ops and therefore it should not set opcisbitwise to true. I also remember that, when thinking about the problem in the context of the aggregate push down patch, I considered some of the geometric types problematic. For example, box_eq() uses this expression #define FPeq(A,B) (fabs((A) - (B)) <= EPSILON) so equality does not imply bitwise equality here. Maybe you should only set the flag for btree opclasses for now. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Robert Haas
Date:
On Wed, Nov 13, 2019 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that that's probably not desirable. There should at least be a > strong practical advantage if we go that way. This would mean ALTER > OPERATOR CLASS could change the "substance" of an opclass, which is > fundamentally different from what it can do already (it currently just > changes the owner, or the schema that it is stored in). My impression is that this is more of an implementation restriction than a design goal. I don't really remember the details, but it seems to me that there were locking and/or cache invalidation problems with making ALTER OPERATOR CLASS do more substantive things -- and that it was because of those problems, not a lack of desire, that we didn't support it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Thu, Dec 19, 2019 at 12:05 PM Robert Haas <robertmhaas@gmail.com> wrote: > My impression is that this is more of an implementation restriction > than a design goal. I don't really remember the details, but it seems > to me that there were locking and/or cache invalidation problems with > making ALTER OPERATOR CLASS do more substantive things -- and that it > was because of those problems, not a lack of desire, that we didn't > support it. I agree with you. My point was only that this is something that the operator class author is really expected to get right the first time around -- just like the behavior of B-Tree support function 1. We're really only concerned about the upgrade path for external types that could see a benefit from the optimization planned for nbtree (and possibly other such optimization). Providing a non-disruptive way to get that benefit after a pg_upgrade only seems like a nice-to-have to me, because it's not as if anything will stop working as well as it once did. Also, there aren't that many external types that will be made more useful by being able to use optimizations like deduplication; in practice almost all B-Tree indexes only use a small handful of operator classes that are shipped in core Postgres. Once you're using common types like text and integer, a pg_upgrade'd database is only a REINDEX away from being able to use deduplication (though I am not even sure if even that will be necessary in the final patch; I hope to be able to avoid even that inconvenience with indexes using core operator classes). If the underlying behavior of an operator class actually changes, then that's a disaster for all the usual reasons. It doesn't make that much sense to reverse an earlier decision to make an operator class BITWISE. Better to drop everything, and recreate everything, since your indexes should be considered corrupt anyway. (Also, I don't think that it's that hard to get it right, so this will probably never happen.) -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Anastasia Lubennikova
Date:
19.12.2019 18:19, Antonin Houska wrote: > Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > >> I attached new version with pg_opclass documentation update. >> >> One more thing I am uncertain about is array_ops. Arrays may contain bitwise >> and not bitwise element types. >> What is the correct value of opcisbitwise the array_ops itself? > How about setting opcisbitwise to false for the array_ops opclass and checking > opcisbitwise of the element type whenever we need to know whether the array is > "bitwise equal"? When checking array_eq(), I thought whether the existence of > "expanded array" format is a problem but it does not seem to be: the > conversion of "expanded" value to "flat" value and then back to the "expanded" > should not change the array contents. > > Anyway, in the current version of the patch I see that array_ops opclasses > have opcisbitwise=true. It should be false even if you don't use the approach > of checking the element type. > > Besides that, I think that record_ops is similar to array_ops and therefore it > should not set opcisbitwise to true. > > I also remember that, when thinking about the problem in the context of the > aggregate push down patch, I considered some of the geometric types > problematic. For example, box_eq() uses this expression > > #define FPeq(A,B) (fabs((A) - (B)) <= EPSILON) > > so equality does not imply bitwise equality here. Maybe you should only set > the flag for btree opclasses for now. Thank you for pointing out at the issue with geometric opclasses. If I understand it correctly, regular float types are not bitwise as well. I updated the patchset. The first patch now contains only infrastructure changes and the second one sets opcisbitwise for btree opclasses in pg_opclass.dat. I've tried to be conservative and only mark types that are 100% bitwise safe. See attached v2-Opclass-isbitwise.out file. Non-atomic types, such as record, range, json and enum depend on element types. Text can be considered bitwise (i.e. texteq uses memcmp) only when specific collation clauses are satisfied. We can make this 'opcisbitwise' parameter enum (or char) instead of boolean to mark "always bitwise", "never bitwise" and "maybe bitwise". Though, I doubt if it will be helpful in any real use case. What do you think? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Alvaro Herrera
Date:
> @@ -106,6 +106,18 @@ CREATE OPERATOR CLASS <replaceable class="parameter">name</replaceable> [ DEFAUL > </listitem> > </varlistentry> > > + <varlistentry> > + <term><literal>NOT BITWISE</literal></term> > + <listitem> > + <para> > + If present, the operator class equality is not the same as equivalence. > + For example, two numerics can compare equal but have different scales. > + Most opclasses implement bitwise equal comparison, alternative behaviour > + must be set explicitly. > + </para> > + </listitem> > + </varlistentry> Am I the only one bothered by the fact that this patch (and all downstream discussion) reduces the term "bitwise equality" to simply "bitwise"? It reads really strange to me, both in the resulting SQL grammar as well as in struct names, code comments etc. "This operator class is bitwise." -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence
From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Am I the only one bothered by the fact that this patch (and all > downstream discussion) reduces the term "bitwise equality" to simply > "bitwise"? It reads really strange to me, both in the resulting SQL > grammar as well as in struct names, code comments etc. "This operator > class is bitwise." I agree, that's really poor English. regards, tom lane
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Anastasia Lubennikova
Date:
24.12.2019 19:08, Alvaro Herrera wrote: >> @@ -106,6 +106,18 @@ CREATE OPERATOR CLASS <replaceable class="parameter">name</replaceable> [ DEFAUL >> </listitem> >> </varlistentry> >> >> + <varlistentry> >> + <term><literal>NOT BITWISE</literal></term> >> + <listitem> >> + <para> >> + If present, the operator class equality is not the same as equivalence. >> + For example, two numerics can compare equal but have different scales. >> + Most opclasses implement bitwise equal comparison, alternative behaviour >> + must be set explicitly. >> + </para> >> + </listitem> >> + </varlistentry> > Am I the only one bothered by the fact that this patch (and all > downstream discussion) reduces the term "bitwise equality" to simply > "bitwise"? It reads really strange to me, both in the resulting SQL > grammar as well as in struct names, code comments etc. "This operator > class is bitwise." > Thank you for pointing that out. Do you have any suggestions on how to name it better? Should it rather be "CREATE OPERATOR CLASS ... BITWISE EQUAL" ? In the recent version of the patch I also had a question, if it will be useful to do this option enum instead of boolean: > We can make this 'opcisbitwise' parameter enum (or char) instead of > boolean to mark > "always bitwise", "never bitwise" and "maybe bitwise". This decision will also affect the syntax. So I'd rather agree on that before updating syntax. Do you have an opinion on that? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Robert Haas
Date:
On Tue, Dec 24, 2019 at 7:29 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > We can make this 'opcisbitwise' parameter enum (or char) instead of > boolean to mark > "always bitwise", "never bitwise" and "maybe bitwise". Though, I doubt > if it will be helpful in any real use case. What would be the difference between "never bitwise" and "maybe bitwise" in that scheme? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Anastasia Lubennikova
Date:
29.12.2019 2:56, Robert Haas wrote: > On Tue, Dec 24, 2019 at 7:29 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> We can make this 'opcisbitwise' parameter enum (or char) instead of >> boolean to mark >> "always bitwise", "never bitwise" and "maybe bitwise". Though, I doubt >> if it will be helpful in any real use case. > What would be the difference between "never bitwise" and "maybe > bitwise" in that scheme? In this design "maybe" category reflects the need for an extra recheck. For example, float and numeric types are "never bitwise equal", while array, text, and other container types are "maybe bitwise equal". An array of integers or text with C collation can be treated as bitwise equal attributes, and it would be too harsh to restrict them from deduplication. What bothers me is that this option will unlikely be helpful on its own and we should also provide some kind of recheck function along with opclass, which complicates this idea even further and doesn't seem very clear. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Robert Haas
Date:
On Mon, Dec 30, 2019 at 10:57 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > In this design "maybe" category reflects the need for an extra recheck. > > For example, float and numeric types are "never bitwise equal", while array, > text, and other container types are "maybe bitwise equal". An array of > integers > or text with C collation can be treated as bitwise equal attributes, and it > would be too harsh to restrict them from deduplication. > > What bothers me is that this option will unlikely be helpful on its own > and we > should also provide some kind of recheck function along with opclass, which > complicates this idea even further and doesn't seem very clear. It seems like the simplest thing might be to forget about the 'char' column and just have a support function which can be used to assess whether a given opclass's notion of equality is bitwise. If equality is always bitwise, the function can always return true. If it's sometimes bitwise, it can return true or false as appropriate. If it's never bitwise, then it can either always return false or the support function can be omitted altogether (so that the safe value is the default). I don't think you're going to save very much by avoiding an indirect function call in the "always" case. It doesn't really seem worth the complexity of making that a special case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Mon, Dec 30, 2019 at 9:45 AM Robert Haas <robertmhaas@gmail.com> wrote: > > For example, float and numeric types are "never bitwise equal", while array, > > text, and other container types are "maybe bitwise equal". An array of > > integers > > or text with C collation can be treated as bitwise equal attributes, and it > > would be too harsh to restrict them from deduplication. We might as well support container types (like array) in the first Postgres version that has nbtree deduplication, I suppose. Even still, I don't think that it actually matters much to users. B-Tree indexes on arrays are probably very rare. Note that I don't consider text to be a container type here -- obviously btree/text_ops is a very important opclass for the deduplication feature. It may be the most important opclass overall. Recursively invoking a support function for the "contained" data type in the btree/array_ops support function seems like it might be messy. Not sure about that, though. > > What bothers me is that this option will unlikely be helpful on its own > > and we > > should also provide some kind of recheck function along with opclass, which > > complicates this idea even further and doesn't seem very clear. > > It seems like the simplest thing might be to forget about the 'char' > column and just have a support function which can be used to assess > whether a given opclass's notion of equality is bitwise. I like the idea of relying only on a support function. This approach makes collations a problem that the opclass author has to deal with directly, as is the case within a SortSupport support function. Also seems like it would make life easier for third party data types that want to make use of these optimizations (if in fact there are any). I also see little downside to this approach. The extra cycles shouldn't be noticeable. As far as the B-Tree deduplication logic is concerned, the final boolean value (is deduplication safe?) comes from the index metapage -- we pass that down through an insertion scankey. We only need to determine whether or not the optimization is safe at CREATE INDEX time. (Actually, I don't want to commit to the idea that nbtree should only call this support function at CREATE INDEX time right now. I'm sure that it will hardly ever need to be called, though.) -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Tue, Dec 24, 2019 at 4:29 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > I updated the patchset. > The first patch now contains only infrastructure changes > and the second one sets opcisbitwise for btree opclasses in pg_opclass.dat. We should try to formally define what we're trying to represent about B-Tree opclasses here -- the definition of "opcisbitwise"/preciseness/whatever should be tightened up. In particular, it should be clear how the "image" binary row comparators [1] (i.e. "operator *= equality" stuff) fit in. This new concept should be defined in terms of that existing concept -- we're talking about exactly the same variety of "internal binary equality" here, I think. I propose that we adopt the following definition: For an operator class to be safe, its equality operator has to always agree with datum_image_eq() (i.e. two datums must be bitwise equal after detoasting). (Maybe we should say something about "operator *= equality" as well (or instead), since that is already documented in [1].) We may also want to say something about foreign keys in this formal definition of "opcisbitwise"/preciseness. Discussion around the bug fixed by commit 1ffa59a85cb [1] showed that there was plenty of confusion in this area. Commit 1ffa59a85cb simply solved the problem that existed with foreign keys, without "joining the dots". It reused the rowtypes.c "operator *= equality" stuff to fix the problem, but only in an ad-hoc and undoumented way. Let's not do that again now. Note: In theory this definition is stricter than truly necessary to make deduplication safe, because we can imagine a contrived case in which an operator class exists where datum_image_eq() does not always agree with the equality operator, even though the equality operator will reliably consider two datums to be equal only when they have identical outputs from the underlying type's output function. This could happen when an operator class author wasn't very careful about zeroing padding -- this may not have mattered to the opclass author because nobody relied on that padding anyway. I think that stuff like this is not worth worrying about -- it can only happen because the datatype/operator class author was very sloppy. Note also: We seem to make this assumption already. Maybe this uninitialized bytes side issue doesn't even need to be pointed out or discussed. The comment just above VALGRIND_CHECK_MEM_IS_DEFINED() within PageAddItemExtended() seems to suggest this. The comment specifically mentions datumIsEqual() (not datum_image_eq()), but it's exactly the same issue. [1] https://www.postgresql.org/docs/devel/functions-comparisons.html#COMPOSITE-TYPE-COMPARISON [2] https://www.postgresql.org/message-id/flat/3326fc2e-bc02-d4c5-e3e5-e54da466e89a%402ndquadrant.com -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Robert Haas
Date:
On Mon, Dec 30, 2019 at 6:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > I propose that we adopt the following definition: For an operator > class to be safe, its equality operator has to always agree with > datum_image_eq() (i.e. two datums must be bitwise equal after > detoasting). I suggested using datumIsEqual() as the canonical definition. (I wonder why datum_image_eq() does not reuse that function?) > Note: In theory this definition is stricter than truly necessary to > make deduplication safe, because we can imagine a contrived case in > which an operator class exists where datum_image_eq() does not always > agree with the equality operator, even though the equality operator > will reliably consider two datums to be equal only when they have > identical outputs from the underlying type's output function. This > could happen when an operator class author wasn't very careful about > zeroing padding -- this may not have mattered to the opclass author > because nobody relied on that padding anyway. I think that stuff like > this is not worth worrying about -- it can only happen because the > datatype/operator class author was very sloppy. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Thu, Jan 2, 2020 at 6:42 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Dec 30, 2019 at 6:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I propose that we adopt the following definition: For an operator > > class to be safe, its equality operator has to always agree with > > datum_image_eq() (i.e. two datums must be bitwise equal after > > detoasting). > > I suggested using datumIsEqual() as the canonical definition. (I > wonder why datum_image_eq() does not reuse that function?) The difference between datum_image_eq() and datumIsEqual() is that only the former will consider two datums equal when they happen to have different TOAST input states -- we need that here. datumIsEqual() avoids doing this because sometimes it needs to work for callers operating within an aborted transaction. datum_image_eq() was originally used for the "*=, *<>, *<, *<=, *>, and *>=" rowtype B-Tree operator class needed by REFRESH MATERIALIZED VIEW CONCURRENTLY. (Actually, that's not quite true, since datum_image_eq() is a spin-off of the rowtype code that was added much more recently to fix a bug in foreign keys.) The B-Tree code and amcheck need to be tolerant of inconsistent TOAST input states. This isn't particularly likely to happen, but it would be hard to revoke the general assumption that that's okay now. Also, it's not that hard to deal with it directly. For example, we're not reliant on equal index tuples all being the same size in the deduplication patch. -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Robert Haas
Date:
On Thu, Jan 2, 2020 at 12:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > The difference between datum_image_eq() and datumIsEqual() is that > only the former will consider two datums equal when they happen to > have different TOAST input states -- we need that here. Ah, OK. Sorry for the noise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Anastasia Lubennikova
Date:
On 31.12.2019 01:40, Peter Geoghegan wrote: > On Mon, Dec 30, 2019 at 9:45 AM Robert Haas <robertmhaas@gmail.com> wrote: >>> For example, float and numeric types are "never bitwise equal", while array, >>> text, and other container types are "maybe bitwise equal". An array of >>> integers >>> or text with C collation can be treated as bitwise equal attributes, and it >>> would be too harsh to restrict them from deduplication. > We might as well support container types (like array) in the first > Postgres version that has nbtree deduplication, I suppose. Even still, > I don't think that it actually matters much to users. B-Tree indexes > on arrays are probably very rare. Note that I don't consider text to > be a container type here -- obviously btree/text_ops is a very > important opclass for the deduplication feature. It may be the most > important opclass overall. > > Recursively invoking a support function for the "contained" data type > in the btree/array_ops support function seems like it might be messy. > Not sure about that, though. > >>> What bothers me is that this option will unlikely be helpful on its own >>> and we >>> should also provide some kind of recheck function along with opclass, which >>> complicates this idea even further and doesn't seem very clear. >> It seems like the simplest thing might be to forget about the 'char' >> column and just have a support function which can be used to assess >> whether a given opclass's notion of equality is bitwise. > I like the idea of relying only on a support function. In attachment you can find the WIP patch that adds support function for btree opclasses. Before continuing, I want to ensure that I understood the discussion above correctly. Current version of the patch adds: 1) new syntax, which allow to provide support function: CREATE OPERATOR CLASS int4_ops_test FOR TYPE int4 USING btree AS OPERATOR 1 =(int4, int4), FUNCTION 1 btint4cmp(int4, int4), SUPPORT datum_image_eqisbitwise; We probably can add more words to specify the purpose of the support function. Do you have any other objections about the place of this new element in CreateOplcass syntax structure? 2) trivial support function that always returns true 'datum_image_eqisbitwise'. It is named after 'datum_image_eq', because we define this support function via its behavior. If this prototype is fine, I will continue this work and add support functions for other opclasses, update pg_dump and documentation. Thoughts?
Attachment
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Mon, Jan 13, 2020 at 12:49 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > In attachment you can find the WIP patch that adds support function for > btree opclasses. Cool. Thanks! > Current version of the patch adds: > > 1) new syntax, which allow to provide support function: > > CREATE OPERATOR CLASS int4_ops_test > FOR TYPE int4 USING btree AS > OPERATOR 1 =(int4, int4), > FUNCTION 1 btint4cmp(int4, int4), > SUPPORT datum_image_eqisbitwise; Hmm. Do we really need these grammar changes? If so, why? I think that you wanted to make this something that could work with any opclass of any index access method, but I don't see that as a useful goal. (If it was useful, it could be considered later, on a case by case basis.) I imagined that this infrastructure would consist of inventing a new variety of B-Tree opclass support function -- something like sortsupport. You could generalize from the example of commit c6e3ac11b60, which added SortSupport functions (also known as "B-Tree support function 2" functions). You might also take a look at the much more recent commit 0a459cec96d, which added in_range functions (also known as "B-Tree support function 3"). Note that neither of those two commits had grammar changes for CREATE OPERATOR CLASS, or anything like that. What I have in mind is a "B-Tree support function 4", obviously. You should probably add a C function that's similar to PrepareSortSupportFromIndexRel()/FinishSortSupportFunction() that will be called from the B-Tree code. This will give a simple yes/no answer to the question: "Is it safe to apply deduplication to this Relation"? This C function will know to return false for an opclass that doesn't have any support function 4 set for any single attribute. It can provide a general overview of what we're telling the caller about the opclass here, etc. Another patch could add a similar function that works with a plain operator, a bit like PrepareSortSupportFromOrderingOp() -- but that isn't necessary now. I suppose that this approach requires something a bit like a struct SortSupportData, with filled-out collation information, etc. The nbtree code expects a simple yes/no answer based on all columns in the index, so it will be necessary to serialize that information to send it across the SQL function interface -- the pg_proc support function will have one argument of type "internal". And, I suppose that you'll also need some basic btvalidate() validation code. > We probably can add more words to specify the purpose of the support > function. Right -- some documentation is needed in btree.sgml, alongside the existing stuff for support functions 1, 2, and 3. > 2) trivial support function that always returns true > 'datum_image_eqisbitwise'. > It is named after 'datum_image_eq', because we define this support > function via its behavior. I like the idea of a generic, trivial SQL-callable function that all simple scalar types can use -- one that just returns true. Maybe we should call this general class of function an "image_equal" function, and refer to "image equality" in the btree.sgml docs. I don't think that using the term "bitwise" is helpful, since it sounds very precise but is actually slightly inaccurate (since TOASTable datums can be "image equal", but not bitwise equal according to datumIsEqual()). How does everyone feel about "image equality" as a name? As I said before, it seems like a good idea to tie this new infrastructure to existing infrastructure used for things like REFRESH MATERIALIZED VIEW CONCURRENTLY. > If this prototype is fine, I will continue this work and add support > functions for other opclasses, update pg_dump and documentation. If this work is structured as a new support function, then it isn't really a user-visible feature -- you won't need pg_dump support, psql support, etc. Most of the documentation will be for operator class authors rather than regular users. We can document the specific opclasses that have support for deduplication later, if at all. I think it would be fine if the deduplication docs (not the docs for this infrastructure) just pointed out specific cases that we *cannot* support -- there are not many exceptions (numeric, text with a nondeterministic collation, a few others like that). -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Sun, Aug 25, 2019 at 1:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I agree that teaching opclasses to say whether this is okay is a > reasonable approach. I've begun working on this, with help from Anastasia. My working assumption is that I only need to care about opclass-declared input data types (pg_opclass.opcintype), plus the corresponding collations -- the former can be used to lookup an appropriate pg_amproc entry (i.e. B-Tree support function 4), while the latter are passed to the support function to get an answer about whether or not it's okay to use deduplication. This approach seems to be good enough as far as the deduplication project's needs are concerned. However, I think that I probably need to take a broader view of the problem than that. Any guidance would be much appreciated. > > Consumers of this new infrastructure probably won't be limited to the > > deduplication feature; > > Indeed, we run up against this sort of thing all the time in, eg, planner > optimizations. I think some sort of "equality is precise" indicator > would be really useful for a lot of things. Suppose I wanted to add support for deduplication of a B-Tree index on an array of integers. This probably wouldn't be very compelling, but just suppose. It's not clear how this could work within the confines of the type and operator class systems. I can hardly determine that it's safe or unsafe to do so at CREATE INDEX time, since the opclass-declared input data type is always the pg_type.oid corresponding to 'anyarray' -- I am forced to make a generic assumption that deduplication is not safe. I must make this conservative assumption since, in general, the indexed column could turn out to be an array of numeric datums -- a "transitively unsafe" anyarray (numeric's display scale issue could leak into anyarray). I'm not actually worried about any practical downside that this may create for users of the B-Tree deduplication feature; a B-Tree index on an array *is* a pretty niche thing. Does seem like I should make sure that I get this right, though. Code like the 'anyarray' B-Tree support function 1 (i.e. btarraycmp()/array_cmp()) doesn't hint at a solution -- it merely does a lookup of the underlying type's comparator using the typcache. That depends on having actual anyarray datums to do something with, which isn't something that this new infrastructure can rely on in any way. I suppose that the only thing that would work here would be to somehow look through the pg_attribute entry for the index column, which will have the details required to distinguish between (say) an array of integers (which is safe, I think) from an array of numerics (which is unsafe). From there, the information about the element type could (say) be passed to the anyarray default opclass' support function 4, which could do its own internal lookup. That seems like it might be a solution in search of a problem, though. BTW, I currently forbid cross-type support function 4 entries for an opclass, on the grounds that that isn't sensible for deduplication. Do you think that that restriction is appropriate in general, given the likelihood that this support function will be used in several other areas? Thanks -- Peter Geoghegan
Re: Building infrastructure for B-Tree deduplication that recognizeswhen opclass equality is also equivalence
From
Peter Geoghegan
Date:
On Sat, Feb 8, 2020 at 6:50 PM Peter Geoghegan <pg@bowt.ie> wrote: > My working assumption is that I only need to care about > opclass-declared input data types (pg_opclass.opcintype), plus the > corresponding collations -- the former can be used to lookup an > appropriate pg_amproc entry (i.e. B-Tree support function 4), while > the latter are passed to the support function to get an answer about > whether or not it's okay to use deduplication. This approach seems to > be good enough as far as the deduplication project's needs are > concerned. However, I think that I probably need to take a broader > view of the problem than that. Any guidance would be much appreciated. v33 of the deduplication patch series was just posted. It included this infrastructure in a separate patch, which isn't that big on its own. See: https://www.postgresql.org/message-id/CAH2-WzmQGYDDoAETGhpGtJQRv_uFHMjvQZ6JdLV-sxGoCgLBNg%40mail.gmail.com Expert review of the opclass infrastructure still seems like a good idea. I'm sure that it does everything that the deduplication feature will ever need, but I'm a little concerned about painting myself into a corner as far as other things that use the API are concerned. In particular, I hope that I haven't failed to anticipate a requirement that the planner has for the new API. Thanks -- Peter Geoghegan