partial heap only tuples - Mailing list pgsql-hackers
From | Bossart, Nathan |
---|---|
Subject | partial heap only tuples |
Date | |
Msg-id | 2ECBBCA0-4D8D-4841-8872-4A5BBDC063D2@amazon.com Whole thread Raw |
Responses |
Re: partial heap only tuples
Re: partial heap only tuples Re: partial heap only tuples |
List | pgsql-hackers |
Hello, I'm hoping to gather some early feedback on a heap optimization I've been working on. In short, I'm hoping to add "partial heap only tuple" (PHOT) support, which would allow you to skip updating indexes for unchanged columns even when other indexes require updates. Today, HOT works wonders when no indexed columns are updated. However, as soon as you touch one indexed column, you lose that optimization entirely, as you must update every index on the table. The resulting performance impact is a pain point for many of our (AWS's) enterprise customers, so we'd like to lend a hand for some improvements in this area. For workloads involving a lot of columns and a lot of indexes, an optimization like PHOT can make a huge difference. I'm aware that there was a previous attempt a few years ago to add a similar optimization called WARM [0] [1]. However, I only noticed this previous effort after coming up with the design for PHOT, so I ended up taking a slightly different approach. I am also aware of a couple of recent nbtree improvements that may mitigate some of the impact of non-HOT updates [2] [3], but I am hoping that PHOT serves as a nice complement to those. I've attached a very early proof-of-concept patch with the design described below. As far as performance is concerned, it is simple enough to show major benefits from PHOT by tacking on a large number of indexes and columns to a table. For a short pgbench run where each table had 5 additional text columns and indexes on every column, I noticed a ~34% bump in TPS with PHOT [4]. Theoretically, the TPS bump should be even higher with additional columns with indexes. In addition to showing such benefits, I have also attempted to show that regular pgbench runs are not significantly affected. For a short pgbench run with no table modifications, I noticed a ~2% bump in TPS with PHOT [5]. Next, I'll go into the design a bit. I've commandeered the two remaining bits in t_infomask2 to use as HEAP_PHOT_UPDATED and HEAP_PHOT_TUPLE. These are analogous to the HEAP_HOT_UPDATED and HEAP_ONLY_TUPLE bits. (If there are concerns about exhausting the t_infomask2 bits, I think we could only use one of the remaining bits as a "modifier" bit on the HOT ones. I opted against that for the proof-of-concept patch to keep things simple.) When creating a PHOT tuple, we only create new index tuples for updated columns. These new index tuples point to the PHOT tuple. Following is a simple demonstration with a table with two integer columns, each with its own index: postgres=# SELECT heap_tuple_infomask_flags(t_infomask, t_infomask2), t_data FROM heap_page_items(get_raw_page('test', 0)) WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL; heap_tuple_infomask_flags | t_data -----------------------------------------------------------------------------+-------------------- ("{HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_PHOT_UPDATED}",{}) | \x0000000000000000 ("{HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_PHOT_UPDATED,HEAP_PHOT_TUPLE}",{}) | \x0100000000000000 ("{HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_PHOT_TUPLE}",{}) | \x0100000002000000 (3 rows) postgres=# SELECT itemoffset, ctid, data FROM bt_page_items(get_raw_page('test_a_idx', 1)); itemoffset | ctid | data ------------+-------+------------------------- 1 | (0,1) | 00 00 00 00 00 00 00 00 2 | (0,2) | 01 00 00 00 00 00 00 00 (2 rows) postgres=# SELECT itemoffset, ctid, data FROM bt_page_items(get_raw_page('test_b_idx', 1)); itemoffset | ctid | data ------------+-------+------------------------- 1 | (0,1) | 00 00 00 00 00 00 00 00 2 | (0,3) | 02 00 00 00 00 00 00 00 (2 rows) When it is time to scan through a PHOT chain, there are a couple of things to account for. Sequential scans work out-of-the-box thanks to the visibility rules, but other types of scans like index scans require additional checks. If you encounter a PHOT chain when performing an index scan, you should only continue following the chain as long as none of the columns the index indexes are modified. If the scan does encounter such a modification, we stop following the chain and continue with the index scan. Even if there is a tuple in that PHOT chain that should be returned by our index scan, we will still find it, as there will be another matching index tuple that points us to later in the PHOT chain. My initial idea for determining which columns were modified was to add a new bitmap after the "nulls" bitmap in the tuple header. However, the attached patch simply uses HeapDetermineModifiedColumns(). I've yet to measure the overhead of this approach versus the bitmap approach, but I haven't noticed anything too detrimental in the testing I've done so far. In my proof-of-concept patch, I've included a temporary hack to get some basic bitmap scans working as expected. Since we won't have followed the PHOT chains in the bitmap index scan, we must know how to follow them in the bitmap heap scan. Unfortunately, the bitmap heap scan has no knowledge of what indexed columns to pay attention to when following the PHOT chains. My temporary hack fixes this by having the bitmap heap scan pull the set of indexed columns it needs to consider from the outer plan. I think this is one area of the design that will require substantially more effort. Following is a demonstration of a basic sequential scan and bitmap scan: postgres=# EXPLAIN (COSTS FALSE) SELECT * FROM test; QUERY PLAN ------------------ Seq Scan on test (1 row) postgres=# SELECT * FROM test; a | b ---+--- 1 | 2 (1 row) postgres=# EXPLAIN (COSTS FALSE) SELECT * FROM test WHERE a >= 0; QUERY PLAN --------------------------------------- Bitmap Heap Scan on test Recheck Cond: (a >= 0) -> Bitmap Index Scan on test_a_idx Index Cond: (a >= 0) (4 rows) postgres=# SELECT * FROM test WHERE a >= 0; a | b ---+--- 1 | 2 (1 row) This design allows for "weaving" between HOT and PHOT in a chain. However, it is still important to treat each consecutive set of HOT updates or PHOT updates as its own chain for the purposes of pruning and cleanup. Pruning is heavily restricted for PHOT due to the presence of corresponding index tuples. I believe we can redirect line pointers for consecutive sets of PHOT updates that modify the same set of indexed columns, but this is only possible if no index has duplicate values in the redirected set. Also, I do not think it is possible to prune intermediate line pointers in a PHOT chain. While it may be possible to redirect all line pointers to the final tuple in a series of updates to the same set of indexed columns, my hunch is that doing so will add significant complexity for tracking intermediate updates, and any performance gains will be marginal. I've created some small diagrams to illustrate my proposed cleanup strategy. Here is a hypothetical PHOT chain. idx1 0 1 2 idx2 0 1 2 idx3 0 lp 1 2 3 4 5 heap (0,0,0) (1,0,0) (2,0,0) (2,1,0) (2,2,0) Heap tuples may be removed and line pointers may be redirected for consecutive updates to the same set of indexes (as long as no index has duplicate values in the redirected set of updates). idx1 0 1 2 idx2 0 1 2 idx3 0 lp 1 2 -> 3 4 -> 5 heap (0,0,0) (2,0,0) (2,2,0) When following redirect chains, we must check that the "interesting" columns for the relevant indexes are not updated whenever a tuple is found. In order to be eligible for cleanup, the final tuple in the corresponding PHOT/HOT chain must also be eligible for cleanup, or all indexes must have been updated later in the chain before any visible tuples. (I suspect that the former condition may cause significant bloat for some workloads and the latter condition may be prohibitively complicated. Perhaps this can be mitigated by limiting how long we allow PHOT chains to get.) My proof-of-concept patch does not yet implement line pointer redirecting and cleanup, so it is possible that I am missing some additional obstacles and optimizations here. I think PostgreSQL 15 is realistically the earliest target version for this change. Given that others find this project worthwhile, that's my goal for this patch. I've CC'd a number of folks who have been involved in this project already and who I'm hoping will continue to help me drive this forward. Nathan [0] https://www.postgresql.org/message-id/flat/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw%40mail.gmail.com [1] https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0d861bbb [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666 [4] non-PHOT: transaction type: <builtin: TPC-B (sort of)> scaling factor: 1000 query mode: simple number of clients: 256 number of threads: 256 duration: 1800 s number of transactions actually processed: 29759733 latency average = 15.484 ms latency stddev = 10.102 ms tps = 16530.552950 (including connections establishing) tps = 16530.730565 (excluding connections establishing) PHOT: ... number of transactions actually processed: 39998968 latency average = 11.520 ms latency stddev = 8.157 ms tps = 22220.709117 (including connections establishing) tps = 22221.182648 (excluding connections establishing) [5] non-PHOT: ... number of transactions actually processed: 151841961 latency average = 3.034 ms latency stddev = 1.854 ms tps = 84354.268591 (including connections establishing) tps = 84355.061353 (excluding connections establishing) PHOT: ... number of transactions actually processed: 155225857 latency average = 2.968 ms latency stddev = 1.264 ms tps = 86234.044783 (including connections establishing) tps = 86234.961286 (excluding connections establishing)
Attachment
pgsql-hackers by date: