Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoDqO7v5cvFUyWD7_pwceJ1XfhA81FVLTspKn0sFxtqNPA@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Sat, Feb 11, 2023 at 2:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> I didn't get any closer to radix-tree regression,

Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.

> but I did find some inefficiencies in tidstore_add_tids() that are worth talking about first, addressed in a rough
fashionin the attached .txt addendums that I can clean up and incorporate later. 
>
> To start, I can reproduce the regression with this test as well:
>
> select * from bench_tidstore_load(0, 10 * 1000 * 1000);
>
> v15 + v26 store + adjustments:
>  mem_allocated | load_ms
> ---------------+---------
>       98202152 |    1676
>
> v26 0001-0008
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1826
>
> ...and reverting to the alternate way to update the parent didn't help:
>
> v26 0001-6, 0008, insert_inner w/ null parent
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1825
>
> ...and I'm kind of glad that wasn't the problem, because going back to that would be a pain for the shmem case.
>
> Running perf doesn't show anything much different in the proportions (note that rt_set must have been inlined when
declaredlocally in v26): 
>
> v15 + v26 store + adjustments:
>   65.88%  postgres  postgres             [.] tidstore_add_tids
>   10.74%  postgres  postgres             [.] rt_set
>    9.20%  postgres  postgres             [.] palloc0
>    6.49%  postgres  postgres             [.] rt_node_insert_leaf
>
> v26 0001-0008
>   78.50%  postgres  postgres             [.] tidstore_add_tids
>    8.88%  postgres  postgres             [.] palloc0
>    6.24%  postgres  postgres             [.] local_rt_node_insert_leaf
>
> v2699-0001: The first thing I noticed is that palloc0 is taking way more time than it should, and it's because the
compilerdoesn't know the values[] array is small. One reason we need to zero the array is to make the algorithm
agnosticabout what order the offsets come in, as I requested in a previous review. Thinking some more, I was way too
paranoidabout that. As long as access methods scan the line pointer array in the usual way, maybe we can just assert
thatthe keys we create are in order, and zero any unused array entries as we find them. (I admit I can't actually think
ofa reason we would ever encounter offsets out of order.) 

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

> Also, we can keep track of the last key we need to consider for insertion into the radix tree, and ignore the rest.
Thatmight shave a few cycles during the exclusive lock when the max offset of an LP_DEAD item < 64 on a given page,
whichI think would be common in the wild. I also got rid of the special case for non-encoding, since shifting by zero
shouldwork the same way. These together led to a nice speedup on the v26 branch: 
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1386
>
> v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass to tid_to_key_off(). That's bad for
tidstore_add_tids()because ItemPointerSetBlockNumber() must do this in order to allow the struct to be SHORTALIGN'd: 
>
> static inline void
> BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
> {
> blockId->bi_hi = blockNumber >> 16;
> blockId->bi_lo = blockNumber & 0xffff;
> }
>
> Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse the above process:
>
> static inline BlockNumber
> BlockIdGetBlockNumber(const BlockIdData *blockId)
> {
> return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) blockId->bi_lo);
> }
>
> There is no reason to do any of this if we're not reading/writing directly to/from an on-disk tid etc. To avoid this,
Icreated a new function encode_key_off() [name could be better], which deals with the raw block number that we already
have.Then turn tid_to_key_off() into a wrapper around that, since we still need the full conversion for
tidstore_lookup_tid().
>
> v2699-0003: Get rid of all the remaining special cases for encoding/or not. I am unaware of the need to optimize that
caseor treat it in any way differently. I haven't tested this on an installation with non-default blocksize and didn't
measurethis separately, but 0002+0003 gives: 
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1259
>
> If these are acceptable, I can incorporate them into a later patchset.

These are nice improvements! I agree with all changes.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
Next
From: Amit Langote
Date:
Subject: Re: ExecRTCheckPerms() and many prunable partitions (sqlsmith)