Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoBPSJ16o9tX9w-iBtxnxVNydm0+4s1ch9jvNbTs=VQ4ag@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > It seems we cannot make this work nicely. IIUC VacDeadItems is
> > allocated in DSM and TidStore is embedded there. However,
> > dead_items->items.area is a local pointer to dsa_area. So we cannot
> > include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
> > would need to pass dsa_area to each interface by callers.
>
> Thanks again for exploring this line of thinking! Okay, it seems even
> if there's a way to make this work, it would be too invasive to
> justify when compared with the advantage I was hoping for.
>
> > > If we can't make this work nicely, I'd be okay with keeping the tid
> > > store control object. My biggest concern is unnecessary
> > > double-locking.
> >
> > If we don't do any locking stuff in radix tree APIs and it's the
> > user's responsibility at all, probably we don't need a lock for
> > tidstore? That is, we expose lock functions as you mentioned and the
> > user (like tidstore) acquires/releases the lock before/after accessing
> > the radix tree and num_items.
>
> I'm not quite sure what the point of "num_items" is anymore, because
> it was really tied to the array in VacDeadItems. dead_items->num_items
> is essential to reading/writing the array correctly. If this number is
> wrong, the array is corrupt. There is no such requirement for the
> radix tree. We don't need to know the number of tids to add to it or
> do a lookup, or anything.

True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
we need to have the number of TIDs in a tidstore, especially in the
tidstore's control object.

>
> There are a number of places where we assert "the running count of the
> dead items" is the same as "the length of the dead items array", like
> here:
>
> @@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
>   BlockNumber threshold;
>
>   Assert(vacrel->num_index_scans == 0);
> - Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
> + Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
>
> As such, in HEAD I'm guessing it's arbitrary which one is used for
> control flow. Correct me if I'm mistaken. If I am wrong for some part
> of the code, it'd be good to understand when that invariant can't be
> maintained.
>
> @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
>   * Do index vacuuming (call each index's ambulkdelete routine), then do
>   * related heap vacuuming
>   */
> - if (dead_items->num_items > 0)
> + if (TidStoreNumTids(dead_items) > 0)
>   lazy_vacuum(vacrel);
>
> Like here. In HEAD, could this have used vacrel->dead_items?
>
> @@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
>   * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
>   * the second heap pass.  No more, no less.
>   */
> - Assert(index > 0);
>   Assert(vacrel->num_index_scans > 1 ||
> -    (index == vacrel->lpdead_items &&
> +    (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
>   vacuumed_pages == vacrel->lpdead_item_pages));
>
>   ereport(DEBUG2,
> - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> - vacrel->relname, (long long) index, vacuumed_pages)));
> + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
> in %u pages",
> + vacrel->relname, TidStoreNumTids(vacrel->dead_items),
> + vacuumed_pages)));
>
> We assert that vacrel->lpdead_items has the expected value, and then
> the ereport repeats the function call (with a lock) to read the value
> we just consulted to pass the assert.
>
> If we *really* want to compare counts, maybe we could invent a
> debugging-only function that iterates over the tree and popcounts the
> bitmaps. That seems too expensive for regular assert builds, though.

IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
the whole lazy vacuum operation whereas num_items is the number of
LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
That is, after heap vacuum, the latter counter is reset while the
former counter is not.

The latter counter is used in lazyvacuum.c as well as the ereport in
vac_bulkdel_one_index().

>
> On the subject of debugging builds, I think it no longer makes sense
> to have the array for debug checking in tid store, even during
> development. A few months ago, we had an encoding scheme that looked
> simple on paper, but its code was fiendishly difficult to follow (at
> least for me). That's gone. In addition to the debugging count above,
> we could also put a copy of the key in the BlockTableEntry's header,
> in debug builds. We don't yet need to care about the key size, since
> we don't (yet) have runtime-embeddable values.

Putting a copy of the key in BlocktableEntry's header is an
interesting idea. But the current debug code in the tidstore also
makes sure that the tidstore returns TIDs in the correct order during
an iterate operation. I think it still has a value and you can disable
it by removing the "#define TIDSTORE_DEBUG" line.

>
> > Currently (as of v52 patch) RT_FIND is
> > doing so,
>
> [meaning, there is no internal "automatic" locking here since after we
> switched to variable-length types, an outstanding TODO]
> Maybe it's okay to expose global locking for v17. I have one possible
> alternative:
>
> This week I tried an idea to use a callback there so that after
> internal unlocking, the caller received the value (or whatever else
> needs to happen, such as lookup an offset in the tid bitmap). I've
> attached a draft for that that passes radix tree tests. It's a bit
> awkward, but I'm guessing this would more closely match future
> internal atomic locking. Let me know what you think of the concept,
> and then do whichever way you think is best. (using v53 as the basis)

Thank you for verifying this idea! Interesting. While it's promising
in terms of future atomic locking, I'm concerned it might not be easy
to use if radix tree APIs supports only such callback style. I believe
the caller would like to pass one more data along with val_data. For
example, considering tidstore that has num_tids internally, it wants
to pass both a pointer to BlocktableEntry and a pointer to TidStore
itself so that it increments the counter while holding a lock.

Another API idea for future atomic locking is to separate
RT_SET()/RT_FIND() into begin and end. In RT_SET_BEGIN() API, we find
the key, extend nodes if necessary, set the value, and return the
result while holding the lock. For example, if the radix tree supports
lock coupling, the leaf node and its parent remain locked. Then the
caller does its job and calls RT_SET_END() that does cleanup stuff
such as releasing locks.

I've not fully considered this approach but even this idea seems
complex and easy to use. I prefer the current simple approach as we
support the simple locking mechanism for now.

>
> I believe this is the only open question remaining. The rest is just
> polish and testing.

Right.

>
> > During trying this idea, I realized that there is a visibility problem
> > in the radix tree template
>
> If it's broken even without the embedding I'll look into this (I don't
> know if this configuration has ever been tested). I think a good test
> is putting the shared tid tree in it's own translation unit, to see if
> anything needs to be fixed. I'll go try that.

Thanks.

BTW in radixtree.h pg_attribute_unused() is used for some functions,
but is it for debugging purposes? I don't see why it's used only for
some functions.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: System username in pg_stat_activity
Next
From: John Naylor
Date:
Subject: Re: Change GUC hashtable to use simplehash?