Thread: Index tuple deduplication limitations in pg13

Index tuple deduplication limitations in pg13

From
Matthias van de Meent
Date:
Hi,

I was reading through the new features of PG13 (beta), and noticed
that deduplication is disabled for float(4, 8) and numeric (and jsonb,
...) due to that the datums of those types could be not binary equal,
but equal according for the opclass used.

But, if the ordering of operator-class equal tuples is already
system-defined, could the physical ordering of index tuples in a btree
(with deduplication enabled for "unsafe" opclasses) be updated from
[index_columns, tid] to [index_columns,
image_compare(non_datum_equal_columns), tid], giving a stable sorting
of opclass-equal and image-equal values and enabling safe consistent
deduplication?

This need not be a default-on feature (opt-in would be fine, as I
guess it could have a performance penalty), but having the ability to
use btree index tuple deduplication for numeric, float and other data
types would count as a win in my book.

-Matthias



Re: Index tuple deduplication limitations in pg13

From
Peter Geoghegan
Date:
On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent
<matthias.vandemeent@cofano.nl> wrote:
> But, if the ordering of operator-class equal tuples is already
> system-defined, could the physical ordering of index tuples in a btree
> (with deduplication enabled for "unsafe" opclasses) be updated from
> [index_columns, tid] to [index_columns,
> image_compare(non_datum_equal_columns), tid], giving a stable sorting
> of opclass-equal and image-equal values and enabling safe consistent
> deduplication?

The issue isn't the physical ordering. The issue is that we cannot
allow the implementation to destroy semantic differences among equal
datums. We avoid deduplication with cases where two equal datums are
visibly different. For example, it would not be okay if we forgot that
your numeric datum was originally input as '5.000', and output '5'
later on.

If we wanted to fix this for numeric, we'd have to invent a new
numeric datatype (called numeric2, say). That probably isn't as hard
as it sounds, since it could be part of the same B-Tree operator
family as numeric. It could also be implicitly cast to numeric.

-- 
Peter Geoghegan



Re: Index tuple deduplication limitations in pg13

From
Peter Geoghegan
Date:
On Tue, Aug 18, 2020 at 9:44 AM Peter Geoghegan <pg@bowt.ie> wrote:
> If we wanted to fix this for numeric, we'd have to invent a new
> numeric datatype (called numeric2, say). That probably isn't as hard
> as it sounds, since it could be part of the same B-Tree operator
> family as numeric. It could also be implicitly cast to numeric.

I forgot to say: numeric2 would be just like numeric, except in one
specific way: it wouldn't care about display scale. The user would be
giving up on display scale by choosing numeric2 over numeric. The "5
vs 5.000" information would always be lost by design, so there'd be
nothing for deduplication to break. Deduplication could then be
enabled.

-- 
Peter Geoghegan



Re: Index tuple deduplication limitations in pg13

From
Matthias van de Meent
Date:
On Tue, 18 Aug 2020 at 18:44, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent
> <matthias.vandemeent@cofano.nl> wrote:
> > But, if the ordering of operator-class equal tuples is already
> > system-defined, could the physical ordering of index tuples in a btree
> > (with deduplication enabled for "unsafe" opclasses) be updated from
> > [index_columns, tid] to [index_columns,
> > image_compare(non_datum_equal_columns), tid], giving a stable sorting
> > of opclass-equal and image-equal values and enabling safe consistent
> > deduplication?
>
> The issue isn't the physical ordering. The issue is that we cannot
> allow the implementation to destroy semantic differences among equal
> datums.

Deduplication does not need to destroy semantic differences? 'equal'
can (in my book) mean:
- 'opclass-equal', that is the opclass returns true for an equality check
- 'binary equal' or 'datum-equal' (? maybe incorrect term), that is
the unprocessed on-disk representations (datum image is the right term
I believe?) of the compared values are indistinguishable.

Runs of 'binary equal' datums can be freely deduplicated [0] when found.

> We avoid deduplication with cases where two equal datums are
> visibly different. For example, it would not be okay if we forgot that
> your numeric datum was originally input as '5.000', and output '5'
> later on.

I agree on that point. But, isn't this display scale also stored in
the 'datum image' [1] and could therefore be distinguished against for
two opclass-equal values whilst deduplicating, only putting binary
equal values in a posting list? e.g. all values generated through
'0.000'::numeric -values can be compressed into one posting tuple
without information loss, as the datum image is effectively the same
for all of these (unless I have missed something), and distinct from
the datum image of '0'::numeric. With enough equal values, you would
eventually have a posting lists for each distinct datum image of equal
values.

Given that the above could work, the current btree tuple ordering is
not optimized for opclass-equal but datum image-distinct values:
ordering of opclass-equal values is currently determined only by tid,
with as an example current ordering ['0.0', '0', '0.00', '0', '0.0',
'0']. It would be more optimized for deduplication if that was stored
as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I
suggested to add an ordering by the datum image before the tid
ordering. Additionally, this extra ordering also prevents the problem
of [0] by never attempting an insertion of non-equal image datums in a
posting list of otherwise equal values, as it would be ordered either
before or after the posting list, never inside the list.

> If we wanted to fix this for numeric, we'd have to invent a new
> numeric datatype (called numeric2, say). That probably isn't as hard
> as it sounds, since it could be part of the same B-Tree operator
> family as numeric. It could also be implicitly cast to numeric.
>
> --
> Peter Geoghegan

[0]
Inserting a row in a deduplicated index with in, with TID ntid, can
encounter a posting list of a opclass-equal but not datum image-equal
tuples where the lowest TID of the posting list is less than ntid, and
ntid is less than the highest TID of the posting list. This would
require a posting list split to accomodate the new tuples' index entry
in order to not lose data.

[1]
# a table with 1000 distinct values, each duplicated 700 times, but
split into 7x100 'datum-equal' value groups that should be
deduplicatable.
SELECT  (n % 1000 || '.' || repeat('0', n%7))::numeric AS num INTO
numbers FROM generate_series(1, 700000) series(n);
CREATE INDEX numeric_index ON numbers (num);
ANALYZE numbers;
SELECT num FROM numbers ORDER BY num LIMIT 700; -- this is an
index-only scan, and results show numeric scale, so the information
must be stored in the index.



Re: Index tuple deduplication limitations in pg13

From
Peter Geoghegan
Date:
On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent
<matthias.vandemeent@cofano.nl> wrote:
> Deduplication does not need to destroy semantic differences? 'equal'
> can (in my book) mean:
> - 'opclass-equal', that is the opclass returns true for an equality check
> - 'binary equal' or 'datum-equal' (? maybe incorrect term), that is
> the unprocessed on-disk representations (datum image is the right term
> I believe?) of the compared values are indistinguishable.
>
> Runs of 'binary equal' datums can be freely deduplicated [0] when found.

> [0]
> Inserting a row in a deduplicated index with in, with TID ntid, can
> encounter a posting list of a opclass-equal but not datum image-equal
> tuples where the lowest TID of the posting list is less than ntid, and
> ntid is less than the highest TID of the posting list. This would
> require a posting list split to accomodate the new tuples' index entry
> in order to not lose data.

But you can't do that easily, because it breaks subtle assumptions
about posting list splits and space utilization. In particular, it
means that you can no longer think of a posting list split as
rewriting an incoming new item such that you can more or less pretend
that there was no overlap in the first place -- code like _bt_split
and nbtsplitloc.c relies on this. Introducing new special cases to
nbtsplitloc.c is very unappealing.

More concretely, if you introduce a posting list split like this then
you need three copies of the key -- the original, the new, and a
second copy of the original. That's much more complicated.

--
Peter Geoghegan



Re: Index tuple deduplication limitations in pg13

From
Peter Geoghegan
Date:
On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent
<matthias.vandemeent@cofano.nl> wrote:
> Given that the above could work, the current btree tuple ordering is
> not optimized for opclass-equal but datum image-distinct values:
> ordering of opclass-equal values is currently determined only by tid,
> with as an example current ordering ['0.0', '0', '0.00', '0', '0.0',
> '0']. It would be more optimized for deduplication if that was stored
> as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I
> suggested to add an ordering by the datum image before the tid
> ordering. Additionally, this extra ordering also prevents the problem
> of [0] by never attempting an insertion of non-equal image datums in a
> posting list of otherwise equal values, as it would be ordered either
> before or after the posting list, never inside the list.

Yeah, that would work, but at the cost of making numeric totally
unusable. Now you cannot rely on unique enforcement detecting that '0'
is a duplicate of '0.0'. In fact, even the most trivial use of the =
operator will be broken in the presence of different display scales.
It's a non-starter.

The numeric2 design that I sketched is a bit ugly, but I can see no
better way. A three-way posting list split (i.e. the other design that
you sketched) is a special case that is very hard to test, very
complicated, and of little value in the grand scheme of things.

-- 
Peter Geoghegan



Re: Index tuple deduplication limitations in pg13

From
Matthias van de Meent
Date:
On Tue, 18 Aug 2020 at 22:00, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent
> <matthias.vandemeent@cofano.nl> wrote:
> > Given that the above could work, the current btree tuple ordering is
> > not optimized for opclass-equal but datum image-distinct values:
> > ordering of opclass-equal values is currently determined only by tid,
> > with as an example current ordering ['0.0', '0', '0.00', '0', '0.0',
> > '0']. It would be more optimized for deduplication if that was stored
> > as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I
> > suggested to add an ordering by the datum image before the tid
> > ordering. Additionally, this extra ordering also prevents the problem
> > of [0] by never attempting an insertion of non-equal image datums in a
> > posting list of otherwise equal values, as it would be ordered either
> > before or after the posting list, never inside the list.
>
> Yeah, that would work, but at the cost of making numeric totally
> unusable. Now you cannot rely on unique enforcement detecting that '0'
> is a duplicate of '0.0'. In fact, even the most trivial use of the =
> operator will be broken in the presence of different display scales.
> It's a non-starter.

Would this extra ordering not effectively be an extra tiebreaker in
the ordering, applied before the TID? I do not know the full
implications of that, but I believe that would not result in the
limitations that you are mentioning.

- Matthias



Re: Index tuple deduplication limitations in pg13

From
Peter Geoghegan
Date:
On Tue, Aug 18, 2020 at 1:31 PM Matthias van de Meent
<matthias.vandemeent@cofano.nl> wrote:
> Would this extra ordering not effectively be an extra tiebreaker in
> the ordering, applied before the TID? I do not know the full
> implications of that, but I believe that would not result in the
> limitations that you are mentioning.

You could probably do it that way, but again you end up with a lot of
new complexity. Not to mention overhead that would have to be paid by
everyone. It would require code that supported the old way (even if it
was added to Postgres 13) for pg_upgrade, that would also be hard to
test. And it might defeat certain future optimizations based on heap
TID being the only tiebreaker. Having two types of equality might have
to bleed into the optimizer.

It's a question of engineering trade-offs. I don't think that it's worth it.

-- 
Peter Geoghegan