Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CANWCAZbycz6MRWUmYA8GhubLNqVoby779=_iL28mRxKApMazgw@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
|
List | pgsql-hackers |
On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 7:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > v77-0001 > > > > - dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items)); > > - dead_items->max_items = max_items; > > - dead_items->num_items = 0; > > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0); > > + > > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo)); > > + dead_items_info->max_bytes = vac_work_mem * 1024L; > > > > This is confusing enough that it looks like a bug: > > > > [inside TidStoreCreate()] > > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */ > > while (16 * maxBlockSize > max_bytes * 1024L) > > maxBlockSize >>= 1; > > > > This was copied from CreateWorkExprContext, which operates directly on > > work_mem -- if the parameter is actually bytes, we can't "* 1024" > > here. If we're passing something measured in kilobytes, the parameter > > is badly named. Let's use convert once and use bytes everywhere. > > True. The attached 0001 patch fixes it. v78-0001 and 02 are fine, but for 0003 there is a consequence that I didn't see mentioned: vac_work_mem now refers to bytes, where before it referred to kilobytes. It seems pretty confusing to use a different convention from elsewhere, especially if it has the same name but different meaning across versions. Worse, this change is buried inside a moving-stuff-around diff, making it hard to see. Maybe "convert only once" is still possible, but I was actually thinking of + dead_items_info->max_bytes = vac_work_mem * 1024L; + vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0); That way it's pretty obvious that it's correct. That may require a bit of duplication and moving around for shmem, but there is some of that already. More on 0003: - * The major space usage for vacuuming is storage for the array of dead TIDs + * The major space usage for vacuuming is TidStore, a storage for dead TIDs + * autovacuum_work_mem) memory space to keep track of dead TIDs. If the + * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum I wonder if the comments here should refer to it using a more natural spelling, like "TID store". - * items in the dead_items array for later vacuuming, count live and + * items in the dead_items for later vacuuming, count live and Maybe "the dead_items area", or "the dead_items store" or "in dead_items"? - * remaining LP_DEAD line pointers on the page in the dead_items - * array. These dead items include those pruned by lazy_scan_prune() - * as well we line pointers previously marked LP_DEAD. + * remaining LP_DEAD line pointers on the page in the dead_items. + * These dead items include those pruned by lazy_scan_prune() as well + * we line pointers previously marked LP_DEAD. Here maybe "into dead_items". Also, "we line pointers" seems to be a pre-existing typo. - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", - vacrel->relname, (long long) index, vacuumed_pages))); + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages", + vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages))); This is a translated message, so let's keep the message the same. /* * Allocate dead_items (either using palloc, or in dynamic shared memory). * Sets dead_items in vacrel for caller. * * Also handles parallel initialization as part of allocating dead_items in * DSM when required. */ static void dead_items_alloc(LVRelState *vacrel, int nworkers) This comment didn't change at all. It's not wrong, but let's consider updating the specifics. v78-0004: > > +#define dsa_create(tranch_id) \ > > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE) > > > > Since these macros are now referring to defaults, maybe their name > > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE > > (*_MAX_*) > > It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that > the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current > name also makes sense to me. Right, that makes sense. v78-0005: "Although commit XXX allowed specifying the initial and maximum DSA segment sizes, callers still needed to clamp their own limits, which was not consistent and user-friendly." Perhaps s/still needed/would have needed/ ..., since we're preventing that necessity. > > Did you try it with 1MB m_w_m? > > I've incorporated the above comments and test results look good to me. Could you be more specific about what the test was? Does it work with 1MB m_w_m? + /* + * Choose the initial and maximum DSA segment sizes to be no longer + * than 1/16 and 1/8 of max_bytes, respectively. If the initial + * segment size is low, we end up having many segments, which risks + * exceeding the total number of segments the platform can have. The second sentence is technically correct, but I'm not sure how it relates to the code that follows. + while (16 * dsa_init_size > max_bytes) + dsa_init_size >>= 1; + while (8 * dsa_max_size > max_bytes) + dsa_max_size >>= 1; I'm not sure we need a separate loop for "dsa_init_size". Can we just have : while (8 * dsa_max_size > max_bytes) dsa_max_size >>= 1; if (dsa_max_size < DSA_MIN_SEGMENT_SIZE) dsa_max_size = DSA_MIN_SEGMENT_SIZE; if (dsa_init_size > dsa_max_size) dsa_init_size = dsa_max_size; @@ -113,13 +113,10 @@ static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno, * CurrentMemoryContext at the time of this call. The TID storage, backed * by a radix tree, will live in its child memory context, rt_context. The * TidStore will be limited to (approximately) max_bytes total memory - * consumption. If the 'area' is non-NULL, the radix tree is created in the - * DSA area. - * - * The returned object is allocated in backend-local memory. + * consumption. The existing comment slipped past my radar, but max_bytes is not a limit, it's a hint. Come to think of it, it never was a limit in the normal sense, but in earlier patches it was the criteria for reporting "I'm full" when asked. void TidStoreDestroy(TidStore *ts) { - /* Destroy underlying radix tree */ if (TidStoreIsShared(ts)) + { + /* Destroy underlying radix tree */ shared_ts_free(ts->tree.shared); + + dsa_detach(ts->area); + } else local_ts_free(ts->tree.local); It's still destroyed in the local case, so not sure why this comment was moved? v78-0006: -#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2 +/* 2 was PARALLEL_VACUUM_KEY_DEAD_ITEMS */ I don't see any use in core outside this module -- maybe it's possible to renumber these?
pgsql-hackers by date: