Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAFBsxsEPQkCr50co+EA-+bvQNaVv-Cr99JxT-CMGBBqs0bGftQ@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 12, 2023 at 5:21 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.

There were some conflicts with HEAD, so to keep the CF bot busy, I've quickly put together v18. I still have a lot of cleanup work to do, but this is enough for now.

0003 contains all v17 local-memory coding squashed together.

0004 perf test not updated but it doesn't build by default so it's fine for now

0005 removes node.chunk as discussed, but does not change node4 fanout yet.

0006 is a small cleanup regarding setting node fanout.

0007 squashes my shared memory work with Masahiko's fixes from the addendum v17-0010.

0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as discussed.

0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the latter rebased over recent variable renaming (it's possible I missed something, so worth checking).

> I've implemented the idea of using union. Let me share WIP code for
> discussion, I've attached three patches that can be applied on top of

Seems fine as far as the union goes. Let's go ahead with this, and make progress on locking etc.

> Overall, TidStore implementation with the union idea doesn't look so
> ugly to me. But I got many compiler warning about unused radix tree
> functions like:
>
> tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
> [-Wunused-function]
>
> I'm not sure there is a convenient way to suppress this warning but
> one idea is to have some macros to specify what operations are
> enabled/declared.

That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting. It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)

Since this is WIP, you may already have some polish in mind, so I won't go over the patches in detail, but I wanted to ask about a few things (numbers referring to v17 addendum, not v18):

0011

+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint32 num_tids;

uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.

+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */

Maybe the #define and comment should be close to here.

+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)

If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.

+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers
+ * in 'offsets' are ordered in ascending order.

Must? What happens otherwise?

+ uint64 last_key = PG_UINT64_MAX;

I'm having some difficulty understanding this sentinel and how it's used.

@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
  if (prunestate.has_lpdead_items)
  {
  Size freespace;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
 
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+  buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);
 
  /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);

This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.

On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I've written a simple script to simulate the DSA memory usage and the
> limit. The 75% limit works fine for a power of two cases, and we can
> use the 60% limit for other cases (it seems we can use up to about 66%
> but used 60% for safety). It would be best if we can mathematically
> prove it but I could prove only the power of two cases. But the script
> practically shows the 60% threshold would work for these cases.

Okay. It's worth highlighting this in the comments, and also the fact that it depends on internal details of how DSA increases segment size.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: Dilip Kumar
Date:
Subject: Re: New strategies for freezing, advancing relfrozenxid early