Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAFBsxsEdQPnvgyJRpVS+cWxQpe-Uq23ou0G0ukfK9-_pCxszZA@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:

In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.

> + * XXX: Most functions in this file have two variants for inner nodes and leaf
> + * nodes, therefore there are duplication codes. While this sometimes makes the
> + * code maintenance tricky, this reduces branch prediction misses when judging
> + * whether the node is a inner node of a leaf node.
>
> This comment seems to be out-of-date since we made it a template.

Done in 0020, along with a bunch of other comment editing.

> The following macros are defined but not undefined in radixtree.h:

Fixed in v21-0018.

Also:

0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.
0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
0012 adopts the benchmark module to the template, and adds meson support (builds with warnings, but okay because not meant for commit).

The rest are cleanups, small refactorings, and more comment rewrites. I've kept them separate for visibility. Next patch can squash them unless there is any discussion.

> > uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.
>
> Agreed, will fix.

Great, but it's now uint64, not int64. All the large counters in struct LVRelState, for example, are signed integers, as the usual practice. Unsigned ints are "usually" for things like bit patterns and where explicit wraparound is desired. There's probably more that can be done here to change to signed types, but I think it's still a bit early to get to that level of nitpicking. (Soon, I hope :-) )

> > + * We calculate the maximum bytes for the TidStore in different ways
> > + * for non-shared case and shared case. Please refer to the comment
> > + * TIDSTORE_MEMORY_DEDUCT for details.
> > + */
> >
> > Maybe the #define and comment should be close to here.
>
> Will fix.

For this, I intended that "here" meant "in or just above the function".

+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6

These symbols are used only once, in tidstore_create(), and are difficult to read. That function has few comments. The symbols have several paragraphs, but they are far away. It might be better for readability to just hard-code numbers in the function, with the explanation about the numbers near where they are used.

> > + * Destroy a TidStore, returning all memory. The caller must be certain that
> > + * no other backend will attempt to access the TidStore before calling this
> > + * function. Other backend must explicitly call tidstore_detach to free up
> > + * backend-local memory associated with the TidStore. The backend that calls
> > + * tidstore_destroy must not call tidstore_detach.
> > + */
> > +void
> > +tidstore_destroy(TidStore *ts)
> >
> > If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.
>
> Will fix.

Did anything change here? There is also this, in the template, which I'm not sure has been addressed:

 * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
 * has the local pointers to nodes, rather than RT_PTR_ALLOC.
 * We need either a safeguard to disallow other processes to begin the iteration
 * while one process is doing or to allow multiple processes to do the iteration.

> > This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.
>
> I agree that we don't need complexity here. I'll try this idea.

Keeping the offsets array in the prunestate seems to work out well.

Some other quick comments on tid store and vacuum, not comprehensive. Let me know if I've misunderstood something:

TID store:

+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit

I was confused for a while, and I realized the bits are in reverse order from how they are usually pictured (high on left, low on the right).

+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with

+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.

Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.

+typedef dsa_pointer tidstore_handle;

It's not clear why we need a typedef here, since here:

+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
...
+ control = handle;

...there is a differently-named dsa_pointer variable that just gets the function parameter.

+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)

size_t is more suitable for memory.

+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */

Hmm? IIUC, the caller only acquires the lock after returning from here, to update statistics. Why is it safe to insert with no lock? Am I missing something?

VACUUM integration:

-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2

Seems like unnecessary churn? It is still all about dead items, after all. I understand using "DSA" for the LWLock, since that matches surrounding code.

+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)

This macro helps the patch readability in some places, but I'm not sure it helps readability of the file as a whole. The following is in the patch and seems perfectly clear without the macro:

- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)

About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)

Now might be a good time to look at earlier XXX comments and come up with a plan to address them.

That's all I have for now.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: New strategies for freezing, advancing relfrozenxid early
Next
From: Daniel Gustafsson
Date:
Subject: Re: TAP output format in pg_regress