Home > mailing lists
partial heap only tuples - Mailing list pgsql-hackers

From	Bossart, Nathan
Subject	partial heap only tuples
Date	February 9, 2021 18:48:21
Msg-id	2ECBBCA0-4D8D-4841-8872-4A5BBDC063D2@amazon.com Whole thread Raw
Responses	Re: partial heap only tuples Re: partial heap only tuples Re: partial heap only tuples
List	pgsql-hackers
Tree view
Hello,

I'm hoping to gather some early feedback on a heap optimization I've
been working on.  In short, I'm hoping to add "partial heap only
tuple" (PHOT) support, which would allow you to skip updating indexes
for unchanged columns even when other indexes require updates.  Today,
HOT works wonders when no indexed columns are updated.  However, as
soon as you touch one indexed column, you lose that optimization
entirely, as you must update every index on the table.  The resulting
performance impact is a pain point for many of our (AWS's) enterprise
customers, so we'd like to lend a hand for some improvements in this
area.  For workloads involving a lot of columns and a lot of indexes,
an optimization like PHOT can make a huge difference.  I'm aware that
there was a previous attempt a few years ago to add a similar
optimization called WARM [0] [1].  However, I only noticed this
previous effort after coming up with the design for PHOT, so I ended
up taking a slightly different approach.  I am also aware of a couple
of recent nbtree improvements that may mitigate some of the impact of
non-HOT updates [2] [3], but I am hoping that PHOT serves as a nice
complement to those.  I've attached a very early proof-of-concept
patch with the design described below.

As far as performance is concerned, it is simple enough to show major
benefits from PHOT by tacking on a large number of indexes and columns
to a table.  For a short pgbench run where each table had 5 additional
text columns and indexes on every column, I noticed a ~34% bump in
TPS with PHOT [4].  Theoretically, the TPS bump should be even higher
with additional columns with indexes.  In addition to showing such
benefits, I have also attempted to show that regular pgbench runs are
not significantly affected.  For a short pgbench run with no table
modifications, I noticed a ~2% bump in TPS with PHOT [5].

Next, I'll go into the design a bit.  I've commandeered the two
remaining bits in t_infomask2 to use as HEAP_PHOT_UPDATED and
HEAP_PHOT_TUPLE.  These are analogous to the HEAP_HOT_UPDATED and
HEAP_ONLY_TUPLE bits.  (If there are concerns about exhausting the
t_infomask2 bits, I think we could only use one of the remaining bits
as a "modifier" bit on the HOT ones.  I opted against that for the
proof-of-concept patch to keep things simple.)  When creating a PHOT
tuple, we only create new index tuples for updated columns.  These new
index tuples point to the PHOT tuple.  Following is a simple
demonstration with a table with two integer columns, each with its own
index:

postgres=# SELECT  heap_tuple_infomask_flags(t_infomask, t_infomask2), t_data
             FROM  heap_page_items(get_raw_page('test', 0))
            WHERE  t_infomask IS NOT NULL
               OR  t_infomask2 IS NOT NULL;
                          heap_tuple_infomask_flags                          |       t_data
-----------------------------------------------------------------------------+--------------------
 ("{HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_PHOT_UPDATED}",{})          | \x0000000000000000
 ("{HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_PHOT_UPDATED,HEAP_PHOT_TUPLE}",{}) | \x0100000000000000
 ("{HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_PHOT_TUPLE}",{})                     | \x0100000002000000
(3 rows)

postgres=# SELECT  itemoffset, ctid, data
             FROM  bt_page_items(get_raw_page('test_a_idx', 1));
 itemoffset | ctid  |          data
------------+-------+-------------------------
          1 | (0,1) | 00 00 00 00 00 00 00 00
          2 | (0,2) | 01 00 00 00 00 00 00 00
(2 rows)

postgres=# SELECT  itemoffset, ctid, data
             FROM  bt_page_items(get_raw_page('test_b_idx', 1));
 itemoffset | ctid  |          data
------------+-------+-------------------------
          1 | (0,1) | 00 00 00 00 00 00 00 00
          2 | (0,3) | 02 00 00 00 00 00 00 00
(2 rows)

When it is time to scan through a PHOT chain, there are a couple of
things to account for.  Sequential scans work out-of-the-box thanks to
the visibility rules, but other types of scans like index scans
require additional checks.  If you encounter a PHOT chain when
performing an index scan, you should only continue following the chain
as long as none of the columns the index indexes are modified.  If the
scan does encounter such a modification, we stop following the chain
and continue with the index scan.  Even if there is a tuple in that
PHOT chain that should be returned by our index scan, we will still
find it, as there will be another matching index tuple that points us
to later in the PHOT chain.  My initial idea for determining which
columns were modified was to add a new bitmap after the "nulls" bitmap
in the tuple header.  However, the attached patch simply uses
HeapDetermineModifiedColumns().  I've yet to measure the overhead of
this approach versus the bitmap approach, but I haven't noticed
anything too detrimental in the testing I've done so far.

In my proof-of-concept patch, I've included a temporary hack to get
some basic bitmap scans working as expected.  Since we won't have
followed the PHOT chains in the bitmap index scan, we must know how to
follow them in the bitmap heap scan.  Unfortunately, the bitmap heap
scan has no knowledge of what indexed columns to pay attention to when
following the PHOT chains.  My temporary hack fixes this by having the
bitmap heap scan pull the set of indexed columns it needs to consider
from the outer plan.  I think this is one area of the design that will
require substantially more effort.  Following is a demonstration of a
basic sequential scan and bitmap scan:

postgres=# EXPLAIN (COSTS FALSE) SELECT * FROM test;
    QUERY PLAN
------------------
 Seq Scan on test
(1 row)

postgres=# SELECT * FROM test;
 a | b
---+---
 1 | 2
(1 row)

postgres=# EXPLAIN (COSTS FALSE) SELECT * FROM test WHERE a >= 0;
              QUERY PLAN
---------------------------------------
 Bitmap Heap Scan on test
   Recheck Cond: (a >= 0)
   ->  Bitmap Index Scan on test_a_idx
         Index Cond: (a >= 0)
(4 rows)

postgres=# SELECT * FROM test WHERE a >= 0;
 a | b
---+---
 1 | 2
(1 row)

This design allows for "weaving" between HOT and PHOT in a chain.
However, it is still important to treat each consecutive set of HOT
updates or PHOT updates as its own chain for the purposes of pruning
and cleanup.  Pruning is heavily restricted for PHOT due to the
presence of corresponding index tuples.  I believe we can redirect
line pointers for consecutive sets of PHOT updates that modify the
same set of indexed columns, but this is only possible if no index has
duplicate values in the redirected set.  Also, I do not think it is
possible to prune intermediate line pointers in a PHOT chain.  While
it may be possible to redirect all line pointers to the final tuple in
a series of updates to the same set of indexed columns, my hunch is
that doing so will add significant complexity for tracking
intermediate updates, and any performance gains will be marginal.
I've created some small diagrams to illustrate my proposed cleanup
strategy.

Here is a hypothetical PHOT chain.

        idx1      0       1       2
        idx2      0                       1       2
        idx3      0
        lp        1       2       3       4       5
        heap      (0,0,0) (1,0,0) (2,0,0) (2,1,0) (2,2,0)

Heap tuples may be removed and line pointers may be redirected for
consecutive updates to the same set of indexes (as long as no index
has duplicate values in the redirected set of updates).

        idx1      0       1       2
        idx2      0                       1       2
        idx3      0
        lp        1       2  ->   3       4  ->   5
        heap      (0,0,0)         (2,0,0)         (2,2,0)

When following redirect chains, we must check that the "interesting"
columns for the relevant indexes are not updated whenever a tuple is
found.  In order to be eligible for cleanup, the final tuple in the
corresponding PHOT/HOT chain must also be eligible for cleanup, or all
indexes must have been updated later in the chain before any visible
tuples.  (I suspect that the former condition may cause significant
bloat for some workloads and the latter condition may be prohibitively
complicated.  Perhaps this can be mitigated by limiting how long we
allow PHOT chains to get.)  My proof-of-concept patch does not yet
implement line pointer redirecting and cleanup, so it is possible that
I am missing some additional obstacles and optimizations here.

I think PostgreSQL 15 is realistically the earliest target version for
this change.  Given that others find this project worthwhile, that's
my goal for this patch.  I've CC'd a number of folks who have been
involved in this project already and who I'm hoping will continue to
help me drive this forward.

Nathan

[0] https://www.postgresql.org/message-id/flat/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw%40mail.gmail.com
[1]
https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0d861bbb
[3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666
[4] non-PHOT:
        transaction type: <builtin: TPC-B (sort of)>
        scaling factor: 1000
        query mode: simple
        number of clients: 256
        number of threads: 256
        duration: 1800 s
        number of transactions actually processed: 29759733
        latency average = 15.484 ms
        latency stddev = 10.102 ms
        tps = 16530.552950 (including connections establishing)
        tps = 16530.730565 (excluding connections establishing)

    PHOT:
        ...
        number of transactions actually processed: 39998968
        latency average = 11.520 ms
        latency stddev = 8.157 ms
        tps = 22220.709117 (including connections establishing)
        tps = 22221.182648 (excluding connections establishing)
[5] non-PHOT:
        ...
        number of transactions actually processed: 151841961
        latency average = 3.034 ms
        latency stddev = 1.854 ms
        tps = 84354.268591 (including connections establishing)
        tps = 84355.061353 (excluding connections establishing)

    PHOT:
        ...
        number of transactions actually processed: 155225857
        latency average = 2.968 ms
        latency stddev = 1.264 ms
        tps = 86234.044783 (including connections establishing)
        tps = 86234.961286 (excluding connections establishing)
Attachment

0001-Early-prototype-for-partial-heap-only-tuples-PHOT.patch
pgsql-hackers by date:
From: Pavel Borisov
Date: 09 February 2021, 18:43:50
Subject: Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.
From: Pavel Borisov
Date: 09 February 2021, 18:57:34
Subject: Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.
partial heap only tuples - Mailing list pgsql-hackers

Attachment

Previous

Next