Re: Index tuple deduplication limitations in pg13 - Mailing list pgsql-general

From Matthias van de Meent
Subject Re: Index tuple deduplication limitations in pg13
Date
Msg-id CAAs3B9obUeXBhGGcFjoSVn8+C+g4Dd0Aa0t5p+F3BaDVVfZnvw@mail.gmail.com
Whole thread Raw
In response to Re: Index tuple deduplication limitations in pg13  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-general
On Tue, 18 Aug 2020 at 18:44, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent
> <matthias.vandemeent@cofano.nl> wrote:
> > But, if the ordering of operator-class equal tuples is already
> > system-defined, could the physical ordering of index tuples in a btree
> > (with deduplication enabled for "unsafe" opclasses) be updated from
> > [index_columns, tid] to [index_columns,
> > image_compare(non_datum_equal_columns), tid], giving a stable sorting
> > of opclass-equal and image-equal values and enabling safe consistent
> > deduplication?
>
> The issue isn't the physical ordering. The issue is that we cannot
> allow the implementation to destroy semantic differences among equal
> datums.

Deduplication does not need to destroy semantic differences? 'equal'
can (in my book) mean:
- 'opclass-equal', that is the opclass returns true for an equality check
- 'binary equal' or 'datum-equal' (? maybe incorrect term), that is
the unprocessed on-disk representations (datum image is the right term
I believe?) of the compared values are indistinguishable.

Runs of 'binary equal' datums can be freely deduplicated [0] when found.

> We avoid deduplication with cases where two equal datums are
> visibly different. For example, it would not be okay if we forgot that
> your numeric datum was originally input as '5.000', and output '5'
> later on.

I agree on that point. But, isn't this display scale also stored in
the 'datum image' [1] and could therefore be distinguished against for
two opclass-equal values whilst deduplicating, only putting binary
equal values in a posting list? e.g. all values generated through
'0.000'::numeric -values can be compressed into one posting tuple
without information loss, as the datum image is effectively the same
for all of these (unless I have missed something), and distinct from
the datum image of '0'::numeric. With enough equal values, you would
eventually have a posting lists for each distinct datum image of equal
values.

Given that the above could work, the current btree tuple ordering is
not optimized for opclass-equal but datum image-distinct values:
ordering of opclass-equal values is currently determined only by tid,
with as an example current ordering ['0.0', '0', '0.00', '0', '0.0',
'0']. It would be more optimized for deduplication if that was stored
as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I
suggested to add an ordering by the datum image before the tid
ordering. Additionally, this extra ordering also prevents the problem
of [0] by never attempting an insertion of non-equal image datums in a
posting list of otherwise equal values, as it would be ordered either
before or after the posting list, never inside the list.

> If we wanted to fix this for numeric, we'd have to invent a new
> numeric datatype (called numeric2, say). That probably isn't as hard
> as it sounds, since it could be part of the same B-Tree operator
> family as numeric. It could also be implicitly cast to numeric.
>
> --
> Peter Geoghegan

[0]
Inserting a row in a deduplicated index with in, with TID ntid, can
encounter a posting list of a opclass-equal but not datum image-equal
tuples where the lowest TID of the posting list is less than ntid, and
ntid is less than the highest TID of the posting list. This would
require a posting list split to accomodate the new tuples' index entry
in order to not lose data.

[1]
# a table with 1000 distinct values, each duplicated 700 times, but
split into 7x100 'datum-equal' value groups that should be
deduplicatable.
SELECT  (n % 1000 || '.' || repeat('0', n%7))::numeric AS num INTO
numbers FROM generate_series(1, 700000) series(n);
CREATE INDEX numeric_index ON numbers (num);
ANALYZE numbers;
SELECT num FROM numbers ORDER BY num LIMIT 700; -- this is an
index-only scan, and results show numeric scale, so the information
must be stored in the index.



pgsql-general by date:

Previous
From: Scottix
Date:
Subject: Re: "Go" (lang) standard driver
Next
From: Peter Geoghegan
Date:
Subject: Re: Index tuple deduplication limitations in pg13