Re: [HACKERS] Block level parallel vacuum - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: [HACKERS] Block level parallel vacuum
Date
Msg-id CAFiTN-s4xx38s0AS-BrNaFDcvpzmYbSgc2YeDVtaQF2CvVkk6w@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Block level parallel vacuum  (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>)
Responses Re: [HACKERS] Block level parallel vacuum  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Thu, 21 Nov 2019, 14:15 Masahiko Sawada, <masahiko.sawada@2ndquadrant.com> wrote:
On Thu, 21 Nov 2019 at 14:32, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Nov 21, 2019 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Nov 20, 2019 at 11:01 AM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > >
> > > On Mon, 18 Nov 2019 at 15:38, Masahiko Sawada
> > > <masahiko.sawada@2ndquadrant.com> wrote:
> > > >
> > > > On Mon, 18 Nov 2019 at 15:34, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Mon, Nov 18, 2019 at 11:37 AM Masahiko Sawada
> > > > > <masahiko.sawada@2ndquadrant.com> wrote:
> > > > > >
> > > > > > On Wed, 13 Nov 2019 at 14:31, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Based on these needs, we came up with a way to allow users to specify
> > > > > > > this information for IndexAm's. Basically, Indexam will expose a
> > > > > > > variable amparallelvacuumoptions which can have below options
> > > > > > >
> > > > > > > VACUUM_OPTION_NO_PARALLEL   1 << 0 # vacuum (neither bulkdelete nor
> > > > > > > vacuumcleanup) can't be performed in parallel
> > > > > >
> > > > > > I think VACUUM_OPTION_NO_PARALLEL can be 0 so that index AMs who don't
> > > > > > want to support parallel vacuum don't have to set anything.
> > > > > >
> > > > >
> > > > > make sense.
> > > > >
> > > > > > > VACUUM_OPTION_PARALLEL_BULKDEL   1 << 1 # bulkdelete can be done in
> > > > > > > parallel (Indexes nbtree, hash, gin, gist, spgist, bloom will set this
> > > > > > > flag)
> > > > > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP  1 << 2 # vacuumcleanup can be
> > > > > > > done in parallel if bulkdelete is not performed (Indexes nbtree, brin,
> > > > > > > gin, gist,
> > > > > > > spgist, bloom will set this flag)
> > > > > > > VACUUM_OPTION_PARALLEL_CLEANUP  1 << 3 # vacuumcleanup can be done in
> > > > > > > parallel even if bulkdelete is already performed (Indexes gin, brin,
> > > > > > > and bloom will set this flag)
> > > > > >
> > > > > > I think gin and bloom don't need to set both but should set only
> > > > > > VACUUM_OPTION_PARALLEL_CLEANUP.
> > > > > >
> > > > > > And I'm going to disallow index AMs to set both
> > > > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP and VACUUM_OPTION_PARALLEL_CLEANUP
> > > > > > by assertions, is that okay?
> > > > > >
> > > > >
> > > > > Sounds reasonable to me.
> > > > >
> > > > > Are you planning to include the changes related to I/O throttling
> > > > > based on the discussion in the nearby thread [1]?  I think you can do
> > > > > that if you agree with the conclusion in the last email[1], otherwise,
> > > > > we can explore it separately.
> > > >
> > > > Yes I agreed. I'm going to include that changes in the next version
> > > > patches. And I think we will be able to do more discussion based on
> > > > the patch.
> > > >
> > >
> > > I've attached the latest version patch set. The patch set includes all
> > > discussed points regarding index AM options as well as shared cost
> > > balance. Also I added some test cases used all types of index AM.
> > >
> > > During developments I had one concern about the number of parallel
> > > workers to launch. In current design each index AMs can choose the
> > > participation of parallel bulk-deletion and parallel cleanup. That
> > > also means the number of parallel worker to launch might be different
> > > for each time of parallel bulk-deletion and parallel cleanup. In
> > > current patch the leader will always launch the number of indexes that
> > > support either one but it would not be efficient in some cases. For
> > > example, if we have 3 indexes supporting only parallel bulk-deletion
> > > and 2 indexes supporting only parallel index cleanup, we would launch
> > > 5 workers for each execution but some workers will do nothing at all.
> > > To deal with this problem, I wonder if we can improve the parallel
> > > query so that the leader process creates a parallel context with the
> > > maximum number of indexes and can launch a part of workers instead of
> > > all of them.
> > >
> > +
> > + /* compute new balance by adding the local value */
> > + shared_balance = pg_atomic_read_u32(VacuumSharedCostBalance);
> > + new_balance = shared_balance + VacuumCostBalance;
> >
> > + /* also compute the total local balance */
> > + local_balance = VacuumCostBalanceLocal + VacuumCostBalance;
> > +
> > + if ((new_balance >= VacuumCostLimit) &&
> > + (local_balance > 0.5 * (VacuumCostLimit / nworkers)))
> > + {
> > + /* compute sleep time based on the local cost balance */
> > + msec = VacuumCostDelay * VacuumCostBalanceLocal / VacuumCostLimit;
> > + new_balance = shared_balance - VacuumCostBalanceLocal;
> > + VacuumCostBalanceLocal = 0;
> > + }
> > +
> > + if (pg_atomic_compare_exchange_u32(VacuumSharedCostBalance,
> > +    &shared_balance,
> > +    new_balance))
> > + {
> > + /* Updated successfully, break */
> > + break;
> > + }
> > While looking at the shared costing delay part, I have noticed that
> > while checking the delay condition, we are considering local_balance
> > which is VacuumCostBalanceLocal + VacuumCostBalance, but while
> > computing the new balance we only reduce shared balance by
> > VacuumCostBalanceLocal,  I think it should be reduced with
> > local_balance?  I see that later we are adding VacuumCostBalance to
> > the VacuumCostBalanceLocal so we are not loosing accounting for this
> > balance.  But, I feel it is not right that we compare based on one
> > value and operate based on other. I think we can immediately set
> > VacuumCostBalanceLocal += VacuumCostBalance before checking the
> > condition.
> >
>
> +/*
> + * index_parallelvacuum_estimate - estimate shared memory for parallel vacuum
> + *
> + * Currently, we don't pass any information to the AM-specific estimator,
> + * so it can probably only return a constant.  In the future, we might need
> + * to pass more information.
> + */
> +Size
> +index_parallelvacuum_estimate(Relation indexRelation)
> +{
> + Size nbytes;
> +
> + RELATION_CHECKS;
> +
> + /*
> + * If amestimateparallelvacuum is not provided, assume only
> + * IndexBulkDeleteResult is needed.
> + */
> + if (indexRelation->rd_indam->amestimateparallelvacuum != NULL)
> + {
> + nbytes = indexRelation->rd_indam->amestimateparallelvacuum();
> + Assert(nbytes >= MAXALIGN(sizeof(IndexBulkDeleteResult)));
> + }
> + else
> + nbytes = MAXALIGN(sizeof(IndexBulkDeleteResult));
> +
> + return nbytes;
> +}
>
> In v33-0001-Add-index-AM-field-and-callback-for-parallel-ind patch,  I
> am a bit doubtful about this kind of arrangement, where the code in
> the "if" is always unreachable with the current AMs.  I am not sure
> what is the best way to handle this, should we just drop the
> amestimateparallelvacuum altogether?

IIUC the motivation of amestimateparallelvacuum is for third party
index AM. If it allocates memory more than IndexBulkDeleteResult like
the current gist indexes (although we'll change it) it will break
index statistics of other indexes or even can be cause of crash. I'm
not sure there is such third party index AMs and it's true that all
index AMs in postgres code will not use this callback as you
mentioned, but I think we need to take care of it because such usage
is still possible.

> Because currently, we are just
> providing a size estimate function without a copy function,  even if
> the in future some Am give an estimate about the size of the stats, we
> can not directly memcpy the stat from the local memory to the shared
> memory, we might then need a copy function also from the AM so that it
> can flatten the stats and store in proper format?

I might be missing something but why can't we copy the stats from the
local memory to the DSM without the callback for copying stats? The
lazy vacuum code will get the pointer of the stats that are allocated
by index AM and the code can know the size of it. So I think we can
just memcpy to DSM.

Oh sure.  But, what I meant is that if AM may keep pointers in its stats as GistBulkDeleteResult do so we might not be able to copy directly outside the AM.  So I thought that if we have a call back for the copy then the AM can flatten the stats such that IndexBulkDeleteResult, followed by AM specific stats.  Yeah but someone may argue that we might force the AM to return the stats in a form that it can be memcpy directly.  So I think I am fine with the way it is.

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?
Next
From: Tomas Vondra
Date:
Subject: Re: why doesn't optimizer can pull up where a > ( ... )