Thread: Building infrastructure for B-Tree deduplication that recognizes whenopclass equality is also equivalence

Anastasia's nbtree deduplication patch [1] has an open problem that I
would like to find a solution for: it currently assumes that there is
no difference between binary equality and opclass equality. That won't
work for opclasses such as btree/numeric, because compressing equal
numeric datums could destroy display scale if equal numeric datums
were naively lumped together (actually, the deduplication patch
doesn't work like that, but it has other subtle problems due to not
having worked out fundamental definitional issues).

We don't need to be able to assume that binary equality is exactly the
same thing as opclass equality at the level of individual tuples. We
only need to be able to assume that the user cannot observe any
differences when they are shown output for two datums that are
opclass-equal for any opclass that supports deduplication (i.e. cases
like the numeric_ops case just won't work, so we shouldn't event try).
I believe that it would be okay if we treated two IndexTuples as
equivalent and therefore targets to store together in the same posting
list when they happen to have distinct binary representations due to
the original datums having different TOAST input state. In short, the
deduplication patch cannot tolerate being unable to store
opclass-equal IndexTuples in the same posting list when the opclass
(or the underlying type being indexed) somehow allows that equality
isn't equivalence -- that's simply unsupportable. The opclass gets one
chance to say whether or not it vetoes the use of deduplication: at
CREATE INDEX time.

Consumers of this new infrastructure probably won't be limited to the
deduplication feature; the same infrastructure will be needed for a
B-Tree prefix compression patch (I'm thinking of a configurable CREATE
INDEX prefix compression feature). GIN never had to solve this problem
because its indexes are always lossy, and cannot support index-only
scans. It seems likely that a scheme like the one I have in mind can
work for the vast majority of Postgres B-Tree indexes in practice, so
I don't think that the user-visible restrictions I'm considering will
make the patch significantly less useful (it's already very useful).
The most notable restriction for users will almost certainly be not
supporting deduplication within indexes that use nondeterministic
collations. They were already paying a performance penalty during
hashing, though.

I would like to:

* Get some buy-in on whether or not the precise distinctions I would
like to make are correct for deduplication in particular, and as
useful as possible for other cases that we may need to add later on.

* Figure out the exact interface through which opclass/opfamily
authors can represent that their notion of equality is compatible with
deduplication/compression. (I think that the use of nondeterministic
collations should disable deduplication without explicit action from
the operator class -- that should just be baked in.)

* Mark most existing btree operator classes as being compatible with
deduplication as part of making the patch committable. As I said, I
believe that their semantics are already compatible with what we need
for deduplication to work sensibly, aside from a handful of specific
exceptions.

In any case, I'm certain that problems like the btree/numeric display
scale problem are simply not worth solving directly. That would add a
huge amount of complexity for very little benefit.

[1] https://commitfest.postgresql.org/24/2202/
-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> We don't need to be able to assume that binary equality is exactly the
> same thing as opclass equality at the level of individual tuples. We
> only need to be able to assume that the user cannot observe any
> differences when they are shown output for two datums that are
> opclass-equal for any opclass that supports deduplication (i.e. cases
> like the numeric_ops case just won't work, so we shouldn't event try).

Hmm, so that would exclude the optimization for numeric, float4/float8,
and nondeterministic text collations.  Anything else?

I agree that teaching opclasses to say whether this is okay is a
reasonable approach.

> Consumers of this new infrastructure probably won't be limited to the
> deduplication feature;

Indeed, we run up against this sort of thing all the time in, eg, planner
optimizations.  I think some sort of "equality is precise" indicator
would be really useful for a lot of things.

            regards, tom lane



On Sun, Aug 25, 2019 at 1:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm, so that would exclude the optimization for numeric, float4/float8,
> and nondeterministic text collations.  Anything else?

Any pseudo-type whose output function could possibly be dependent on
the output function of another type (in case it happens to be one of
the types that definitely aren't safe). Maybe we could make fine
distinctions about pseudo-type safety in certain contexts, but that
doesn't matter to the deduplication patch.

> I agree that teaching opclasses to say whether this is okay is a
> reasonable approach.

Great.

> > Consumers of this new infrastructure probably won't be limited to the
> > deduplication feature;
>
> Indeed, we run up against this sort of thing all the time in, eg, planner
> optimizations.  I think some sort of "equality is precise" indicator
> would be really useful for a lot of things.

The case that I happened to think of was "collation strength
reduction". In other words, an optimization that has the planner use a
merge equijoin whose joinqual involves two text columns using the "C"
collation, even though the "C" collation isn't otherwise usable.
Perhaps there are far more compelling planner optimization that I
haven't considered, though. This idea probably has problems with
interesting sort orders that aren't actually that interesting.

-- 
Peter Geoghegan



On Sun, Aug 25, 2019 at 2:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Indeed, we run up against this sort of thing all the time in, eg, planner
> > optimizations.  I think some sort of "equality is precise" indicator
> > would be really useful for a lot of things.
>
> The case that I happened to think of was "collation strength
> reduction".

I was thinking of stashing an "equality is precise" flag in the
metapage of each nbtree index, since we will only determine this once,
at CREATE INDEX time. That would make it fairly natural for the
planner to ask about the "equality is precise"-ness of the index at
the same point that it calls _bt_getrootheight(): within
get_relation_info().

-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> I was thinking of stashing an "equality is precise" flag in the
> metapage of each nbtree index, since we will only determine this once,
> at CREATE INDEX time.

Sure.

> That would make it fairly natural for the
> planner to ask about the "equality is precise"-ness of the index at
> the same point that it calls _bt_getrootheight(): within
> get_relation_info().

The planner will almost certainly want to ask the opclass directly,
because most of the places where it wants to know this sort of thing
about operator behavior have nothing to do with indexes.

            regards, tom lane



On Sun, Aug 25, 2019 at 2:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I was thinking of stashing an "equality is precise" flag in the
> > metapage of each nbtree index, since we will only determine this once,
> > at CREATE INDEX time.
>
> Sure.

I suppose that we'd add something new to CREATE OPERATOR CLASS to make
this work? My instinct is to avoid adding things that are only
meaningful for a single AM to interfaces like CREATE OPERATOR CLASS,
but the system already has numerous dependencies on B-Tree opclasses
that seem comparable to me.

There is a single case where nbtree stores a type that differs from
the type actually being indexed by the operator class: the "name"
case, where the underlying storage type is actually cstring. I'm not
sure whether or not this needs to be treated as its own kind of
special case. I suppose that we can ignore it completely, because
we're not directly concerned with the physical representation used
within an index. In fact, a major goal for this new infrastructure is
that nbtree gets to fully own the representation (it just needs to
know about the high level or logical requirements).

-- 
Peter Geoghegan



On Sun, Aug 25, 2019 at 2:55 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I suppose that we'd add something new to CREATE OPERATOR CLASS to make
> this work? My instinct is to avoid adding things that are only
> meaningful for a single AM to interfaces like CREATE OPERATOR CLASS,
> but the system already has numerous dependencies on B-Tree opclasses
> that seem comparable to me.

Another question is whether or not it would be okay to define
"equality is precise"-ness to be "the system's generic equality
function works perfectly as a drop-in replacement for my own equality
operator's function". The system's generic equality function could be
the recently added datum_image_eq() function -- that looks like it
will do exactly what I have in mind. This would be a new way of using
datum_image_eq(), I think, since it wouldn't be okay for it to give an
answer that differed from the equality operator's function. It looks
like existing datum_image_eq() callers can deal with false negatives
(but not false positives, which are impossible).

This exceeds what is strictly necessary for the deduplication patch,
but it seems like the patch should make comparisons as fast as
possible in the context of deduplicating items (it would be nice if it
could just use datum_image_eq instead of an insertion scankey when
doing many comparisons to deduplicate items). We can imagine a
datatype with undefined garbage bytes that affect the answer that
datum_image_eq() gives, but could be safe targets for deduplication,
so it's not clear if being this aggressive will work. But maybe that
isn't actually possible among types that aren't inherently unsafe for
deduplication. And maybe we could be more aggressive with
optimizations in numerous other contexts by defining "equality is
precise"-ness as strict binary equality after accounting for TOAST
compression.

-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> wrote:

> Consumers of this new infrastructure probably won't be limited to the
> deduplication feature;

It'd also solve an open problem of the aggregate push-down patch [1], in
particular see the mention of pg_opclass in [2]: the partial aggregate
node below the final join must not put multiple opclass-equal values of
which are not byte-wise equal into the same group because some
information needed by WHERE or JOIN/ON condition may be lost this
way. The scale of the numeric type is the most obvious example.

> I would like to:
> 
> * Get some buy-in on whether or not the precise distinctions I would
> like to make are correct for deduplication in particular, and as
> useful as possible for other cases that we may need to add later on.
> 
> * Figure out the exact interface through which opclass/opfamily
> authors can represent that their notion of equality is compatible with
> deduplication/compression.

It's not entirely clear to me whether opclass or opfamily should carry
this information. opclass probably makes more sense for index related
problems and the aggregate push-down patch can live with that. I don't
see particular reason to add any flag to opfamily. (Planner uses uses
both pg_opclass and pg_opfamily catalogs.)

I think the fact that the aggregate push-down would benefit from this
enhancement should affect choice of the new catalog attribute name,
i.e. it should be not mention words as concrete as "deduplication" or
"compression".

> (I think that the use of nondeterministic collations should disable
> deduplication without explicit action from the operator class -- that
> should just be baked in.)

(I think the aggregate push-down needs to consider the nondeterministic
collations too, I missed that so far.)

[1] https://commitfest.postgresql.org/24/1247/

[2] https://www.postgresql.org/message-id/10529.1547561178%40localhost

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



26.08.2019 14:15, Antonin Houska wrote:
> Peter Geoghegan <pg@bowt.ie> wrote:
>
>> Consumers of this new infrastructure probably won't be limited to the
>> deduplication feature;
> It'd also solve an open problem of the aggregate push-down patch [1], in
> particular see the mention of pg_opclass in [2]: the partial aggregate
> node below the final join must not put multiple opclass-equal values of
> which are not byte-wise equal into the same group because some
> information needed by WHERE or JOIN/ON condition may be lost this
> way. The scale of the numeric type is the most obvious example.
>
>> I would like to:
>>
>> * Get some buy-in on whether or not the precise distinctions I would
>> like to make are correct for deduplication in particular, and as
>> useful as possible for other cases that we may need to add later on.
>>
>> * Figure out the exact interface through which opclass/opfamily
>> authors can represent that their notion of equality is compatible with
>> deduplication/compression.
> It's not entirely clear to me whether opclass or opfamily should carry
> this information. opclass probably makes more sense for index related
> problems and the aggregate push-down patch can live with that. I don't
> see particular reason to add any flag to opfamily. (Planner uses uses
> both pg_opclass and pg_opfamily catalogs.)
>
> I think the fact that the aggregate push-down would benefit from this
> enhancement should affect choice of the new catalog attribute name,
> i.e. it should be not mention words as concrete as "deduplication" or
> "compression".


The patch implementing new opclass option is attached.

It adds new attribute pg_opclass.opcisbitwise, which is set to true if 
opclass equality is the same as binary equality.
By default it is true. It is set to false for numeric and float4, float8.

Does anyarray opclasses need special treatment?

New syntax for create opclass is  "CREATE OPERATOR CLASS NOT BITWISE ..."

Any ideas on better names?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment
Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

> The patch implementing new opclass option is attached.
>
> It adds new attribute pg_opclass.opcisbitwise, which is set to true if opclass
> equality is the same as binary equality.
> By default it is true.

I think the default value should be false and we should only set it to true
for individual opclasses which do meet the bitwise equality requirement. Also
extension authors should explicitly state that their data types are bitwise
equal. Otherwise the existing opclasses, when created via pg_dump ->
pg_restore, can be used by the system incorrectly.

> It is set to false for numeric and float4, float8.

Are you sure about these?

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



>>>>> "Antonin" == Antonin Houska <ah@cybertec.at> writes:

 >> It is set to false for numeric and float4, float8.

 Antonin> Are you sure about these?

numeric values can compare equal but have different display scales (see
hash_numeric).

float4 and float8 both have representations for -0, which compares equal
to 0. (numeric technically has a representation for -0 too, but I
believe the current code carefully avoids ever generating it.)

-- 
Andrew (irc:RhodiumToad)



01.10.2019 8:41, Antonin Houska wrote:
> Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
>
>> The patch implementing new opclass option is attached.
>>
>> It adds new attribute pg_opclass.opcisbitwise, which is set to true if opclass
>> equality is the same as binary equality.
>> By default it is true.
> I think the default value should be false and we should only set it to true
> for individual opclasses which do meet the bitwise equality requirement. Also
> extension authors should explicitly state that their data types are bitwise
> equal. Otherwise the existing opclasses, when created via pg_dump ->
> pg_restore, can be used by the system incorrectly.

Thank you for the feedback.

At first I implemented bitwise as default, because it is more common .
Though, I agree that it's essential to avoid false positives here.
The new version of the patch is attached. I also updated pg_dump.

A few more open questions:
1) How to handle contrib modules that create new opclasses?
Since default is 'not bitwise' it means that various operator classes 
created in extensions
such as bloom, btree_gin and others, won't be able to take advantage of 
various optimizations
that require the opclass to be BITWISE.

'v2-Opclass-bitwise-equality-0002' patch simply adds BITWISE keyword 
where necessary.

2) Whether we should provide ALTER OPERATOR CLASS SET BITWISE syntax?

3) Current patch modifies regression test so that it checks CREATE 
OPCLASS BITWISE syntax.
Is there anything else worth testing? As I see it, this patch is just 
about infrastructure changes,
and more specific tests will be added by features that will implement 
further optimizations.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment
On Mon, Oct 28, 2019 at 11:11 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> At first I implemented bitwise as default, because it is more common .
> Though, I agree that it's essential to avoid false positives here.
> The new version of the patch is attached. I also updated pg_dump.
>
> A few more open questions:
> 1) How to handle contrib modules that create new opclasses?
> Since default is 'not bitwise' it means that various operator classes
> created in extensions
> such as bloom, btree_gin and others, won't be able to take advantage of
> various optimizations
> that require the opclass to be BITWISE.

What optimizations? Do we anticipate that other index AMs will benefit
from BITWISE-ness?

> 'v2-Opclass-bitwise-equality-0002' patch simply adds BITWISE keyword
> where necessary.
>
> 2) Whether we should provide ALTER OPERATOR CLASS SET BITWISE syntax?

I think that that's probably not desirable. There should at least be a
strong practical advantage if we go that way. This would mean ALTER
OPERATOR CLASS could change the "substance" of an opclass, which is
fundamentally different from what it can do already (it currently just
changes the owner, or the schema that it is stored in).

> 3) Current patch modifies regression test so that it checks CREATE
> OPCLASS BITWISE syntax.
> Is there anything else worth testing? As I see it, this patch is just
> about infrastructure changes,
> and more specific tests will be added by features that will implement
> further optimizations.

I think so too -- this is really about associating a single piece of
information with an operator class.

BTW: No need to bump catversion when posting a patch, which I see in
"v2-Opclass-*0001.patch". That is our policy. (A catversion bump is
generally supposed to be done at the last minute, just as the patch is
committed. This avoids unnecessary conflicts against the master branch
over time, as a patch is developed.)

--
Peter Geoghegan



14.11.2019 0:25, Peter Geoghegan wrote:
> On Mon, Oct 28, 2019 at 11:11 AM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru>  wrote:
>> At first I implemented bitwise as default, because it is more common .
>> Though, I agree that it's essential to avoid false positives here.
>> The new version of the patch is attached. I also updated pg_dump.
>>
>> A few more open questions:
>> 1) How to handle contrib modules that create new opclasses?
>> Since default is 'not bitwise' it means that various operator classes
>> created in extensions
>> such as bloom, btree_gin and others, won't be able to take advantage of
>> various optimizations
>> that require the opclass to be BITWISE.
> What optimizations? Do we anticipate that other index AMs will benefit
> from BITWISE-ness?
I was thinking of possible planner optimizations, that Tom mentioned up 
thread.
Though, I don't have any specific examples. Anyway, we can implement 
support for user-defined opclasses later.
>> 3) Current patch modifies regression test so that it checks CREATE
>> OPCLASS BITWISE syntax.
>> Is there anything else worth testing? As I see it, this patch is just
>> about infrastructure changes,
>> and more specific tests will be added by features that will implement
>> further optimizations.
> I think so too -- this is really about associating a single piece of
> information with an operator class.
Great. It seems that the patch is ready for commit.
I attached new version with pg_opclass documentation update.

One more thing I am uncertain about  is array_ops. Arrays may contain 
bitwise and not bitwise element types.
What is the correct value of opcisbitwise the array_ops itself?

-- 
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company


Attachment
Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

> I attached new version with pg_opclass documentation update.
>
> One more thing I am uncertain about  is array_ops. Arrays may contain bitwise
> and not bitwise element types.
> What is the correct value of opcisbitwise the array_ops itself?

How about setting opcisbitwise to false for the array_ops opclass and checking
opcisbitwise of the element type whenever we need to know whether the array is
"bitwise equal"? When checking array_eq(), I thought whether the existence of
"expanded array" format is a problem but it does not seem to be: the
conversion of "expanded" value to "flat" value and then back to the "expanded"
should not change the array contents.

Anyway, in the current version of the patch I see that array_ops opclasses
have opcisbitwise=true. It should be false even if you don't use the approach
of checking the element type.

Besides that, I think that record_ops is similar to array_ops and therefore it
should not set opcisbitwise to true.

I also remember that, when thinking about the problem in the context of the
aggregate push down patch, I considered some of the geometric types
problematic. For example, box_eq() uses this expression

#define FPeq(A,B)                (fabs((A) - (B)) <= EPSILON)

so equality does not imply bitwise equality here. Maybe you should only set
the flag for btree opclasses for now.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



On Wed, Nov 13, 2019 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that that's probably not desirable. There should at least be a
> strong practical advantage if we go that way. This would mean ALTER
> OPERATOR CLASS could change the "substance" of an opclass, which is
> fundamentally different from what it can do already (it currently just
> changes the owner, or the schema that it is stored in).

My impression is that this is more of an implementation restriction
than a design goal. I don't really remember the details, but it seems
to me that there were locking and/or cache invalidation problems with
making ALTER OPERATOR CLASS do more substantive things -- and that it
was because of those problems, not a lack of desire, that we didn't
support it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Dec 19, 2019 at 12:05 PM Robert Haas <robertmhaas@gmail.com> wrote:
> My impression is that this is more of an implementation restriction
> than a design goal. I don't really remember the details, but it seems
> to me that there were locking and/or cache invalidation problems with
> making ALTER OPERATOR CLASS do more substantive things -- and that it
> was because of those problems, not a lack of desire, that we didn't
> support it.

I agree with you. My point was only that this is something that the
operator class author is really expected to get right the first time
around -- just like the behavior of B-Tree support function 1. We're
really only concerned about the upgrade path for external types that
could see a benefit from the optimization planned for nbtree (and
possibly other such optimization). Providing a non-disruptive way to
get that benefit after a pg_upgrade only seems like a nice-to-have to
me, because it's not as if anything will stop working as well as it
once did. Also, there aren't that many external types that will be
made more useful by being able to use optimizations like
deduplication; in practice almost all B-Tree indexes only use a small
handful of operator classes that are shipped in core Postgres. Once
you're using common types like text and integer, a pg_upgrade'd
database is only a REINDEX away from being able to use deduplication
(though I am not even sure if even that will be necessary in the final
patch; I hope to be able to avoid even that inconvenience with indexes
using core operator classes).

If the underlying behavior of an operator class actually changes, then
that's a disaster for all the usual reasons. It doesn't make that much
sense to reverse an earlier decision to make an operator class
BITWISE. Better to drop everything, and recreate everything, since
your indexes should be considered corrupt anyway. (Also, I don't think
that it's that hard to get it right, so this will probably never
happen.)

-- 
Peter Geoghegan



19.12.2019 18:19, Antonin Houska wrote:
> Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
>
>> I attached new version with pg_opclass documentation update.
>>
>> One more thing I am uncertain about  is array_ops. Arrays may contain bitwise
>> and not bitwise element types.
>> What is the correct value of opcisbitwise the array_ops itself?
> How about setting opcisbitwise to false for the array_ops opclass and checking
> opcisbitwise of the element type whenever we need to know whether the array is
> "bitwise equal"? When checking array_eq(), I thought whether the existence of
> "expanded array" format is a problem but it does not seem to be: the
> conversion of "expanded" value to "flat" value and then back to the "expanded"
> should not change the array contents.
>
> Anyway, in the current version of the patch I see that array_ops opclasses
> have opcisbitwise=true. It should be false even if you don't use the approach
> of checking the element type.
>
> Besides that, I think that record_ops is similar to array_ops and therefore it
> should not set opcisbitwise to true.
>
> I also remember that, when thinking about the problem in the context of the
> aggregate push down patch, I considered some of the geometric types
> problematic. For example, box_eq() uses this expression
>
> #define FPeq(A,B)                (fabs((A) - (B)) <= EPSILON)
>
> so equality does not imply bitwise equality here. Maybe you should only set
> the flag for btree opclasses for now.

Thank you for pointing out at the issue with geometric opclasses.
If I understand it correctly, regular float types are not bitwise as well.

I updated the patchset.
The first patch now contains only infrastructure changes
and the second one sets opcisbitwise for btree opclasses in pg_opclass.dat.

I've tried to be conservative and only mark types that are 100% bitwise 
safe.
See attached v2-Opclass-isbitwise.out file.

Non-atomic types, such as record, range, json and enum depend on element 
types.
Text can be considered bitwise (i.e. texteq uses memcmp) only when 
specific collation clauses are satisfied.

We can make this 'opcisbitwise' parameter enum (or char) instead of 
boolean to mark
"always bitwise", "never bitwise" and "maybe bitwise". Though, I doubt 
if it will be helpful in any real use case.

What do you think?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment
> @@ -106,6 +106,18 @@ CREATE OPERATOR CLASS <replaceable class="parameter">name</replaceable> [ DEFAUL
>      </listitem>
>     </varlistentry>
>  
> +    <varlistentry>
> +    <term><literal>NOT BITWISE</literal></term>
> +    <listitem>
> +     <para>
> +      If present, the operator class equality is not the same as equivalence.
> +      For example, two numerics can compare equal but have different scales.
> +      Most opclasses implement bitwise equal comparison, alternative behaviour
> +      must be set explicitly.
> +     </para>
> +    </listitem>
> +   </varlistentry>

Am I the only one bothered by the fact that this patch (and all
downstream discussion) reduces the term "bitwise equality" to simply
"bitwise"?  It reads really strange to me, both in the resulting SQL
grammar as well as in struct names, code comments etc.  "This operator
class is bitwise."

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Am I the only one bothered by the fact that this patch (and all
> downstream discussion) reduces the term "bitwise equality" to simply
> "bitwise"?  It reads really strange to me, both in the resulting SQL
> grammar as well as in struct names, code comments etc.  "This operator
> class is bitwise."

I agree, that's really poor English.

            regards, tom lane



24.12.2019 19:08, Alvaro Herrera wrote:
>> @@ -106,6 +106,18 @@ CREATE OPERATOR CLASS <replaceable class="parameter">name</replaceable> [ DEFAUL
>>       </listitem>
>>      </varlistentry>
>>   
>> +    <varlistentry>
>> +    <term><literal>NOT BITWISE</literal></term>
>> +    <listitem>
>> +     <para>
>> +      If present, the operator class equality is not the same as equivalence.
>> +      For example, two numerics can compare equal but have different scales.
>> +      Most opclasses implement bitwise equal comparison, alternative behaviour
>> +      must be set explicitly.
>> +     </para>
>> +    </listitem>
>> +   </varlistentry>
> Am I the only one bothered by the fact that this patch (and all
> downstream discussion) reduces the term "bitwise equality" to simply
> "bitwise"?  It reads really strange to me, both in the resulting SQL
> grammar as well as in struct names, code comments etc.  "This operator
> class is bitwise."
>
Thank you for pointing that out.
Do you have any suggestions on how to name it better?
Should it rather be "CREATE OPERATOR CLASS ... BITWISE EQUAL" ?

In the recent version of the patch I also had a question,
if it will be useful to do this option enum instead of boolean:

> We can make this 'opcisbitwise' parameter enum (or char) instead of 
> boolean to mark
> "always bitwise", "never bitwise" and "maybe bitwise". 

This decision will also affect the syntax. So I'd rather agree on that 
before updating syntax.
Do you have an opinion on that?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




On Tue, Dec 24, 2019 at 7:29 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> We can make this 'opcisbitwise' parameter enum (or char) instead of
> boolean to mark
> "always bitwise", "never bitwise" and "maybe bitwise". Though, I doubt
> if it will be helpful in any real use case.

What would be the difference between "never bitwise" and "maybe
bitwise" in that scheme?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



29.12.2019 2:56, Robert Haas wrote:
> On Tue, Dec 24, 2019 at 7:29 AM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> We can make this 'opcisbitwise' parameter enum (or char) instead of
>> boolean to mark
>> "always bitwise", "never bitwise" and "maybe bitwise". Though, I doubt
>> if it will be helpful in any real use case.
> What would be the difference between "never bitwise" and "maybe
> bitwise" in that scheme?

In this design "maybe" category reflects the need for an extra recheck.

For example, float and numeric types are "never bitwise equal", while array,
text, and other container types are "maybe bitwise equal". An array of 
integers
or text with C collation can be treated as bitwise equal attributes, and it
would be too harsh to restrict them from deduplication.

What bothers me is that this option will unlikely be helpful on its own 
and we
should also provide some kind of recheck function along with opclass, which
complicates this idea even further and doesn't seem very clear.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




On Mon, Dec 30, 2019 at 10:57 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> In this design "maybe" category reflects the need for an extra recheck.
>
> For example, float and numeric types are "never bitwise equal", while array,
> text, and other container types are "maybe bitwise equal". An array of
> integers
> or text with C collation can be treated as bitwise equal attributes, and it
> would be too harsh to restrict them from deduplication.
>
> What bothers me is that this option will unlikely be helpful on its own
> and we
> should also provide some kind of recheck function along with opclass, which
> complicates this idea even further and doesn't seem very clear.

It seems like the simplest thing might be to forget about the 'char'
column and just have a support function which can be used to assess
whether a given opclass's notion of equality is bitwise. If equality
is always bitwise, the function can always return true. If it's
sometimes bitwise, it can return true or false as appropriate. If it's
never bitwise, then it can either always return false or the support
function can be omitted altogether (so that the safe value is the
default).

I don't think you're going to save very much by avoiding an indirect
function call in the "always" case. It doesn't really seem worth the
complexity of making that a special case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Dec 30, 2019 at 9:45 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > For example, float and numeric types are "never bitwise equal", while array,
> > text, and other container types are "maybe bitwise equal". An array of
> > integers
> > or text with C collation can be treated as bitwise equal attributes, and it
> > would be too harsh to restrict them from deduplication.

We might as well support container types (like array) in the first
Postgres version that has nbtree deduplication, I suppose. Even still,
I don't think that it actually matters much to users. B-Tree indexes
on arrays are probably very rare. Note that I don't consider text to
be a container type here -- obviously btree/text_ops is a very
important opclass for the deduplication feature. It may be the most
important opclass overall.

Recursively invoking a support function for the "contained" data type
in the btree/array_ops support function seems like it might be messy.
Not sure about that, though.

> > What bothers me is that this option will unlikely be helpful on its own
> > and we
> > should also provide some kind of recheck function along with opclass, which
> > complicates this idea even further and doesn't seem very clear.
>
> It seems like the simplest thing might be to forget about the 'char'
> column and just have a support function which can be used to assess
> whether a given opclass's notion of equality is bitwise.

I like the idea of relying only on a support function.

This approach makes collations a problem that the opclass author has
to deal with directly, as is the case within a SortSupport support
function. Also seems like it would make life easier for third party
data types that want to make use of these optimizations (if in fact
there are any).

I also see little downside to this approach. The extra cycles
shouldn't be noticeable. As far as the B-Tree deduplication logic is
concerned, the final boolean value (is deduplication safe?) comes from
the index metapage -- we pass that down through an insertion scankey.
We only need to determine whether or not the optimization is safe at
CREATE INDEX time. (Actually, I don't want to commit to the idea that
nbtree should only call this support function at CREATE INDEX time
right now. I'm sure that it will hardly ever need to be called,
though.)

-- 
Peter Geoghegan



On Tue, Dec 24, 2019 at 4:29 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> I updated the patchset.
> The first patch now contains only infrastructure changes
> and the second one sets opcisbitwise for btree opclasses in pg_opclass.dat.

We should try to formally define what we're trying to represent about
B-Tree opclasses here -- the definition of
"opcisbitwise"/preciseness/whatever should be tightened up. In
particular, it should be clear how the "image" binary row comparators
[1] (i.e. "operator *= equality" stuff) fit in. This new concept
should be defined in terms of that existing concept -- we're talking
about exactly the same variety of "internal binary equality" here, I
think.

I propose that we adopt the following definition: For an operator
class to be safe, its equality operator has to always agree with
datum_image_eq() (i.e. two datums must be bitwise equal after
detoasting).

(Maybe we should say something about "operator *= equality" as well
(or instead), since that is already documented in [1].)

We may also want to say something about foreign keys in this formal
definition of "opcisbitwise"/preciseness. Discussion around the bug
fixed by commit 1ffa59a85cb [1] showed that there was plenty of
confusion in this area. Commit 1ffa59a85cb simply solved the problem
that existed with foreign keys, without "joining the dots". It reused
the rowtypes.c "operator *= equality" stuff to fix the problem, but
only in an ad-hoc and undoumented way. Let's not do that again now.

Note: In theory this definition is stricter than truly necessary to
make deduplication safe, because we can imagine a contrived case in
which an operator class exists where datum_image_eq() does not always
agree with the equality operator, even though the equality operator
will reliably consider two datums to be equal only when they have
identical outputs from the underlying type's output function. This
could happen when an operator class author wasn't very careful about
zeroing padding -- this may not have mattered to the opclass author
because nobody relied on that padding anyway. I think that stuff like
this is not worth worrying about -- it can only happen because the
datatype/operator class author was very sloppy.

Note also: We seem to make this assumption already. Maybe this
uninitialized bytes side issue doesn't even need to be pointed out or
discussed. The comment just above VALGRIND_CHECK_MEM_IS_DEFINED()
within PageAddItemExtended() seems to suggest this. The comment
specifically mentions datumIsEqual() (not datum_image_eq()), but it's
exactly the same issue.

[1] https://www.postgresql.org/docs/devel/functions-comparisons.html#COMPOSITE-TYPE-COMPARISON
[2] https://www.postgresql.org/message-id/flat/3326fc2e-bc02-d4c5-e3e5-e54da466e89a%402ndquadrant.com

-- 
Peter Geoghegan



On Mon, Dec 30, 2019 at 6:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I propose that we adopt the following definition: For an operator
> class to be safe, its equality operator has to always agree with
> datum_image_eq() (i.e. two datums must be bitwise equal after
> detoasting).

I suggested using datumIsEqual() as the canonical definition. (I
wonder why datum_image_eq() does not reuse that function?)

> Note: In theory this definition is stricter than truly necessary to
> make deduplication safe, because we can imagine a contrived case in
> which an operator class exists where datum_image_eq() does not always
> agree with the equality operator, even though the equality operator
> will reliably consider two datums to be equal only when they have
> identical outputs from the underlying type's output function. This
> could happen when an operator class author wasn't very careful about
> zeroing padding -- this may not have mattered to the opclass author
> because nobody relied on that padding anyway. I think that stuff like
> this is not worth worrying about -- it can only happen because the
> datatype/operator class author was very sloppy.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Jan 2, 2020 at 6:42 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 30, 2019 at 6:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I propose that we adopt the following definition: For an operator
> > class to be safe, its equality operator has to always agree with
> > datum_image_eq() (i.e. two datums must be bitwise equal after
> > detoasting).
>
> I suggested using datumIsEqual() as the canonical definition. (I
> wonder why datum_image_eq() does not reuse that function?)

The difference between datum_image_eq() and datumIsEqual() is that
only the former will consider two datums equal when they happen to
have different TOAST input states -- we need that here. datumIsEqual()
avoids doing this because sometimes it needs to work for callers
operating within an aborted transaction.  datum_image_eq() was
originally used for the "*=, *<>, *<, *<=, *>, and *>=" rowtype B-Tree
operator class needed by REFRESH MATERIALIZED VIEW CONCURRENTLY.
(Actually, that's not quite true, since datum_image_eq() is a spin-off
of the rowtype code that was added much more recently to fix a bug in
foreign keys.)

The B-Tree code and amcheck need to be tolerant of inconsistent TOAST
input states. This isn't particularly likely to happen, but it would
be hard to revoke the general assumption that that's okay now. Also,
it's not that hard to deal with it directly. For example, we're not
reliant on equal index tuples all being the same size in the
deduplication patch.

-- 
Peter Geoghegan



On Thu, Jan 2, 2020 at 12:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The difference between datum_image_eq() and datumIsEqual() is that
> only the former will consider two datums equal when they happen to
> have different TOAST input states -- we need that here.

Ah, OK. Sorry for the noise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 31.12.2019 01:40, Peter Geoghegan wrote:
> On Mon, Dec 30, 2019 at 9:45 AM Robert Haas <robertmhaas@gmail.com> wrote:
>>> For example, float and numeric types are "never bitwise equal", while array,
>>> text, and other container types are "maybe bitwise equal". An array of
>>> integers
>>> or text with C collation can be treated as bitwise equal attributes, and it
>>> would be too harsh to restrict them from deduplication.
> We might as well support container types (like array) in the first
> Postgres version that has nbtree deduplication, I suppose. Even still,
> I don't think that it actually matters much to users. B-Tree indexes
> on arrays are probably very rare. Note that I don't consider text to
> be a container type here -- obviously btree/text_ops is a very
> important opclass for the deduplication feature. It may be the most
> important opclass overall.
>
> Recursively invoking a support function for the "contained" data type
> in the btree/array_ops support function seems like it might be messy.
> Not sure about that, though.
>
>>> What bothers me is that this option will unlikely be helpful on its own
>>> and we
>>> should also provide some kind of recheck function along with opclass, which
>>> complicates this idea even further and doesn't seem very clear.
>> It seems like the simplest thing might be to forget about the 'char'
>> column and just have a support function which can be used to assess
>> whether a given opclass's notion of equality is bitwise.
> I like the idea of relying only on a support function.

In attachment you can find the WIP patch that adds support function for 
btree opclasses.
Before continuing, I want to ensure that I understood the discussion 
above correctly.

Current version of the patch adds:

1) new syntax, which allow to provide support function:

CREATE OPERATOR CLASS int4_ops_test
FOR TYPE int4 USING btree AS
         OPERATOR 1 =(int4, int4),
         FUNCTION 1 btint4cmp(int4, int4),
         SUPPORT datum_image_eqisbitwise;

We probably can add more words to specify the purpose of the support 
function.
Do you have any other objections about the place of this new element in 
CreateOplcass syntax structure?

2) trivial support function that always returns true 
'datum_image_eqisbitwise'.
It is named after 'datum_image_eq', because we define this support 
function via its behavior.

If this prototype is fine, I will continue this work and add support 
functions for other opclasses, update pg_dump and documentation.

Thoughts?


Attachment
On Mon, Jan 13, 2020 at 12:49 PM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> In attachment you can find the WIP patch that adds support function for
> btree opclasses.

Cool. Thanks!

> Current version of the patch adds:
>
> 1) new syntax, which allow to provide support function:
>
> CREATE OPERATOR CLASS int4_ops_test
> FOR TYPE int4 USING btree AS
>          OPERATOR 1 =(int4, int4),
>          FUNCTION 1 btint4cmp(int4, int4),
>          SUPPORT datum_image_eqisbitwise;

Hmm. Do we really need these grammar changes? If so, why? I think that
you wanted to make this something that could work with any opclass of
any index access method, but I don't see that as a useful goal. (If it
was useful, it could be considered later, on a case by case basis.)

I imagined that this infrastructure would consist of inventing a new
variety of B-Tree opclass support function -- something like
sortsupport. You could generalize from the example of commit
c6e3ac11b60, which added SortSupport functions (also known as "B-Tree
support function 2" functions). You might also take a look at the much
more recent commit 0a459cec96d, which added in_range functions (also
known as "B-Tree support function 3"). Note that neither of those two
commits had grammar changes for CREATE OPERATOR CLASS, or anything
like that. What I have in mind is a "B-Tree support function 4",
obviously.

You should probably add a C function that's similar to
PrepareSortSupportFromIndexRel()/FinishSortSupportFunction() that will
be called from the B-Tree code. This will give a simple yes/no answer
to the question: "Is it safe to apply deduplication to this Relation"?
This C function will know to return false for an opclass that doesn't
have any support function 4 set for any single attribute. It can
provide a general overview of what we're telling the caller about the
opclass here, etc. Another patch could add a similar function that
works with a plain operator, a bit like
PrepareSortSupportFromOrderingOp() -- but that isn't necessary now.

I suppose that this approach requires something a bit like a struct
SortSupportData, with filled-out collation information, etc. The
nbtree code expects a simple yes/no answer based on all columns in the
index, so it will be necessary to serialize that information to send
it across the SQL function interface -- the pg_proc support function
will have one argument of type "internal". And, I suppose that you'll
also need some basic btvalidate() validation code.

> We probably can add more words to specify the purpose of the support
> function.

Right -- some documentation is needed in btree.sgml, alongside the
existing stuff for support functions 1, 2, and 3.

> 2) trivial support function that always returns true
> 'datum_image_eqisbitwise'.
> It is named after 'datum_image_eq', because we define this support
> function via its behavior.

I like the idea of a generic, trivial SQL-callable function that all
simple scalar types can use -- one that just returns true. Maybe we
should call this general class of function an "image_equal" function,
and refer to "image equality" in the btree.sgml docs. I don't think
that using the term "bitwise" is helpful, since it sounds very precise
but is actually slightly inaccurate (since TOASTable datums can be
"image equal", but not bitwise equal according to datumIsEqual()).

How does everyone feel about "image equality" as a name? As I said
before, it seems like a good idea to tie this new infrastructure to
existing infrastructure used for things like REFRESH MATERIALIZED VIEW
CONCURRENTLY.

> If this prototype is fine, I will continue this work and add support
> functions for other opclasses, update pg_dump and documentation.

If this work is structured as a new support function, then it isn't
really a user-visible feature -- you won't need pg_dump support, psql
support, etc. Most of the documentation will be for operator class
authors rather than regular users. We can document the specific
opclasses that have support for deduplication later, if at all.

I think it would be fine if the deduplication docs (not the docs for
this infrastructure) just pointed out specific cases that we *cannot*
support -- there are not many exceptions (numeric, text with a
nondeterministic collation, a few others like that).

--
Peter Geoghegan



On Sun, Aug 25, 2019 at 1:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I agree that teaching opclasses to say whether this is okay is a
> reasonable approach.

I've begun working on this, with help from Anastasia.

My working assumption is that I only need to care about
opclass-declared input data types (pg_opclass.opcintype), plus the
corresponding collations -- the former can be used to lookup an
appropriate pg_amproc entry (i.e. B-Tree support function 4), while
the latter are passed to the support function to get an answer about
whether or not it's okay to use deduplication. This approach seems to
be good enough as far as the deduplication project's needs are
concerned. However, I think that I probably need to take a broader
view of the problem than that. Any guidance would be much appreciated.

> > Consumers of this new infrastructure probably won't be limited to the
> > deduplication feature;
>
> Indeed, we run up against this sort of thing all the time in, eg, planner
> optimizations.  I think some sort of "equality is precise" indicator
> would be really useful for a lot of things.

Suppose I wanted to add support for deduplication of a B-Tree index on
an array of integers. This probably wouldn't be very compelling, but
just suppose. It's not clear how this could work within the confines
of the type and operator class systems.

I can hardly determine that it's safe or unsafe to do so at CREATE
INDEX time, since the opclass-declared input data type is always the
pg_type.oid corresponding to 'anyarray' -- I am forced to make a
generic assumption that deduplication is not safe. I must make this
conservative assumption since, in general, the indexed column could
turn out to be an array of numeric datums -- a "transitively unsafe"
anyarray (numeric's display scale issue could leak into anyarray). I'm
not actually worried about any practical downside that this may create
for users of the B-Tree deduplication feature; a B-Tree index on an
array *is* a pretty niche thing. Does seem like I should make sure
that I get this right, though.

Code like the 'anyarray' B-Tree support function 1 (i.e.
btarraycmp()/array_cmp()) doesn't hint at a solution -- it merely does
a lookup of the underlying type's comparator using the typcache. That
depends on having actual anyarray datums to do something with, which
isn't something that this new infrastructure can rely on in any way.

I suppose that the only thing that would work here would be to somehow
look through the pg_attribute entry for the index column, which will
have the details required to distinguish between (say) an array of
integers (which is safe, I think) from an array of numerics (which is
unsafe). From there, the information about the element type could
(say) be passed to the anyarray default opclass' support function 4,
which could do its own internal lookup. That seems like it might be a
solution in search of a problem, though.

BTW, I currently forbid cross-type support function 4 entries for an
opclass, on the grounds that that isn't sensible for deduplication. Do
you think that that restriction is appropriate in general, given the
likelihood that this support function will be used in several other
areas?

Thanks
-- 
Peter Geoghegan



On Sat, Feb 8, 2020 at 6:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
> My working assumption is that I only need to care about
> opclass-declared input data types (pg_opclass.opcintype), plus the
> corresponding collations -- the former can be used to lookup an
> appropriate pg_amproc entry (i.e. B-Tree support function 4), while
> the latter are passed to the support function to get an answer about
> whether or not it's okay to use deduplication. This approach seems to
> be good enough as far as the deduplication project's needs are
> concerned. However, I think that I probably need to take a broader
> view of the problem than that. Any guidance would be much appreciated.

v33 of the deduplication patch series was just posted. It included
this infrastructure in a separate patch, which isn't that big on its
own. See:

https://www.postgresql.org/message-id/CAH2-WzmQGYDDoAETGhpGtJQRv_uFHMjvQZ6JdLV-sxGoCgLBNg%40mail.gmail.com

Expert review of the opclass infrastructure still seems like a good
idea. I'm sure that it does everything that the deduplication feature
will ever need, but I'm a little concerned about painting myself into
a corner as far as other things that use the API are concerned. In
particular, I hope that I haven't failed to anticipate a requirement
that the planner has for the new API.

Thanks
-- 
Peter Geoghegan