Thread: Index tuple deduplication limitations in pg13
Hi, I was reading through the new features of PG13 (beta), and noticed that deduplication is disabled for float(4, 8) and numeric (and jsonb, ...) due to that the datums of those types could be not binary equal, but equal according for the opclass used. But, if the ordering of operator-class equal tuples is already system-defined, could the physical ordering of index tuples in a btree (with deduplication enabled for "unsafe" opclasses) be updated from [index_columns, tid] to [index_columns, image_compare(non_datum_equal_columns), tid], giving a stable sorting of opclass-equal and image-equal values and enabling safe consistent deduplication? This need not be a default-on feature (opt-in would be fine, as I guess it could have a performance penalty), but having the ability to use btree index tuple deduplication for numeric, float and other data types would count as a win in my book. -Matthias
On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent <matthias.vandemeent@cofano.nl> wrote: > But, if the ordering of operator-class equal tuples is already > system-defined, could the physical ordering of index tuples in a btree > (with deduplication enabled for "unsafe" opclasses) be updated from > [index_columns, tid] to [index_columns, > image_compare(non_datum_equal_columns), tid], giving a stable sorting > of opclass-equal and image-equal values and enabling safe consistent > deduplication? The issue isn't the physical ordering. The issue is that we cannot allow the implementation to destroy semantic differences among equal datums. We avoid deduplication with cases where two equal datums are visibly different. For example, it would not be okay if we forgot that your numeric datum was originally input as '5.000', and output '5' later on. If we wanted to fix this for numeric, we'd have to invent a new numeric datatype (called numeric2, say). That probably isn't as hard as it sounds, since it could be part of the same B-Tree operator family as numeric. It could also be implicitly cast to numeric. -- Peter Geoghegan
On Tue, Aug 18, 2020 at 9:44 AM Peter Geoghegan <pg@bowt.ie> wrote: > If we wanted to fix this for numeric, we'd have to invent a new > numeric datatype (called numeric2, say). That probably isn't as hard > as it sounds, since it could be part of the same B-Tree operator > family as numeric. It could also be implicitly cast to numeric. I forgot to say: numeric2 would be just like numeric, except in one specific way: it wouldn't care about display scale. The user would be giving up on display scale by choosing numeric2 over numeric. The "5 vs 5.000" information would always be lost by design, so there'd be nothing for deduplication to break. Deduplication could then be enabled. -- Peter Geoghegan
On Tue, 18 Aug 2020 at 18:44, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent > <matthias.vandemeent@cofano.nl> wrote: > > But, if the ordering of operator-class equal tuples is already > > system-defined, could the physical ordering of index tuples in a btree > > (with deduplication enabled for "unsafe" opclasses) be updated from > > [index_columns, tid] to [index_columns, > > image_compare(non_datum_equal_columns), tid], giving a stable sorting > > of opclass-equal and image-equal values and enabling safe consistent > > deduplication? > > The issue isn't the physical ordering. The issue is that we cannot > allow the implementation to destroy semantic differences among equal > datums. Deduplication does not need to destroy semantic differences? 'equal' can (in my book) mean: - 'opclass-equal', that is the opclass returns true for an equality check - 'binary equal' or 'datum-equal' (? maybe incorrect term), that is the unprocessed on-disk representations (datum image is the right term I believe?) of the compared values are indistinguishable. Runs of 'binary equal' datums can be freely deduplicated [0] when found. > We avoid deduplication with cases where two equal datums are > visibly different. For example, it would not be okay if we forgot that > your numeric datum was originally input as '5.000', and output '5' > later on. I agree on that point. But, isn't this display scale also stored in the 'datum image' [1] and could therefore be distinguished against for two opclass-equal values whilst deduplicating, only putting binary equal values in a posting list? e.g. all values generated through '0.000'::numeric -values can be compressed into one posting tuple without information loss, as the datum image is effectively the same for all of these (unless I have missed something), and distinct from the datum image of '0'::numeric. With enough equal values, you would eventually have a posting lists for each distinct datum image of equal values. Given that the above could work, the current btree tuple ordering is not optimized for opclass-equal but datum image-distinct values: ordering of opclass-equal values is currently determined only by tid, with as an example current ordering ['0.0', '0', '0.00', '0', '0.0', '0']. It would be more optimized for deduplication if that was stored as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I suggested to add an ordering by the datum image before the tid ordering. Additionally, this extra ordering also prevents the problem of [0] by never attempting an insertion of non-equal image datums in a posting list of otherwise equal values, as it would be ordered either before or after the posting list, never inside the list. > If we wanted to fix this for numeric, we'd have to invent a new > numeric datatype (called numeric2, say). That probably isn't as hard > as it sounds, since it could be part of the same B-Tree operator > family as numeric. It could also be implicitly cast to numeric. > > -- > Peter Geoghegan [0] Inserting a row in a deduplicated index with in, with TID ntid, can encounter a posting list of a opclass-equal but not datum image-equal tuples where the lowest TID of the posting list is less than ntid, and ntid is less than the highest TID of the posting list. This would require a posting list split to accomodate the new tuples' index entry in order to not lose data. [1] # a table with 1000 distinct values, each duplicated 700 times, but split into 7x100 'datum-equal' value groups that should be deduplicatable. SELECT (n % 1000 || '.' || repeat('0', n%7))::numeric AS num INTO numbers FROM generate_series(1, 700000) series(n); CREATE INDEX numeric_index ON numbers (num); ANALYZE numbers; SELECT num FROM numbers ORDER BY num LIMIT 700; -- this is an index-only scan, and results show numeric scale, so the information must be stored in the index.
On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent <matthias.vandemeent@cofano.nl> wrote: > Deduplication does not need to destroy semantic differences? 'equal' > can (in my book) mean: > - 'opclass-equal', that is the opclass returns true for an equality check > - 'binary equal' or 'datum-equal' (? maybe incorrect term), that is > the unprocessed on-disk representations (datum image is the right term > I believe?) of the compared values are indistinguishable. > > Runs of 'binary equal' datums can be freely deduplicated [0] when found. > [0] > Inserting a row in a deduplicated index with in, with TID ntid, can > encounter a posting list of a opclass-equal but not datum image-equal > tuples where the lowest TID of the posting list is less than ntid, and > ntid is less than the highest TID of the posting list. This would > require a posting list split to accomodate the new tuples' index entry > in order to not lose data. But you can't do that easily, because it breaks subtle assumptions about posting list splits and space utilization. In particular, it means that you can no longer think of a posting list split as rewriting an incoming new item such that you can more or less pretend that there was no overlap in the first place -- code like _bt_split and nbtsplitloc.c relies on this. Introducing new special cases to nbtsplitloc.c is very unappealing. More concretely, if you introduce a posting list split like this then you need three copies of the key -- the original, the new, and a second copy of the original. That's much more complicated. -- Peter Geoghegan
On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent <matthias.vandemeent@cofano.nl> wrote: > Given that the above could work, the current btree tuple ordering is > not optimized for opclass-equal but datum image-distinct values: > ordering of opclass-equal values is currently determined only by tid, > with as an example current ordering ['0.0', '0', '0.00', '0', '0.0', > '0']. It would be more optimized for deduplication if that was stored > as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I > suggested to add an ordering by the datum image before the tid > ordering. Additionally, this extra ordering also prevents the problem > of [0] by never attempting an insertion of non-equal image datums in a > posting list of otherwise equal values, as it would be ordered either > before or after the posting list, never inside the list. Yeah, that would work, but at the cost of making numeric totally unusable. Now you cannot rely on unique enforcement detecting that '0' is a duplicate of '0.0'. In fact, even the most trivial use of the = operator will be broken in the presence of different display scales. It's a non-starter. The numeric2 design that I sketched is a bit ugly, but I can see no better way. A three-way posting list split (i.e. the other design that you sketched) is a special case that is very hard to test, very complicated, and of little value in the grand scheme of things. -- Peter Geoghegan
On Tue, 18 Aug 2020 at 22:00, Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent > <matthias.vandemeent@cofano.nl> wrote: > > Given that the above could work, the current btree tuple ordering is > > not optimized for opclass-equal but datum image-distinct values: > > ordering of opclass-equal values is currently determined only by tid, > > with as an example current ordering ['0.0', '0', '0.00', '0', '0.0', > > '0']. It would be more optimized for deduplication if that was stored > > as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I > > suggested to add an ordering by the datum image before the tid > > ordering. Additionally, this extra ordering also prevents the problem > > of [0] by never attempting an insertion of non-equal image datums in a > > posting list of otherwise equal values, as it would be ordered either > > before or after the posting list, never inside the list. > > Yeah, that would work, but at the cost of making numeric totally > unusable. Now you cannot rely on unique enforcement detecting that '0' > is a duplicate of '0.0'. In fact, even the most trivial use of the = > operator will be broken in the presence of different display scales. > It's a non-starter. Would this extra ordering not effectively be an extra tiebreaker in the ordering, applied before the TID? I do not know the full implications of that, but I believe that would not result in the limitations that you are mentioning. - Matthias
On Tue, Aug 18, 2020 at 1:31 PM Matthias van de Meent <matthias.vandemeent@cofano.nl> wrote: > Would this extra ordering not effectively be an extra tiebreaker in > the ordering, applied before the TID? I do not know the full > implications of that, but I believe that would not result in the > limitations that you are mentioning. You could probably do it that way, but again you end up with a lot of new complexity. Not to mention overhead that would have to be paid by everyone. It would require code that supported the old way (even if it was added to Postgres 13) for pg_upgrade, that would also be hard to test. And it might defeat certain future optimizations based on heap TID being the only tiebreaker. Having two types of equality might have to bleed into the optimizer. It's a question of engineering trade-offs. I don't think that it's worth it. -- Peter Geoghegan