Re: Block level parallel vacuum WIP - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Block level parallel vacuum WIP
Date
Msg-id CAD21AoDn6YUya9ar0=s92Li9N=Zmiq+dhWtkD8UuEOV3xLn8gw@mail.gmail.com
Whole thread Raw
In response to Re: Block level parallel vacuum WIP  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Aug 23, 2016 at 10:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 23, 2016 at 7:02 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> I'd like to propose block level parallel VACUUM.
>> This feature makes VACUUM possible to use multiple CPU cores.
>
> Great.  This is something that I have thought about, too.  Andres and
> Heikki recommended it as a project to me a few PGCons ago.
>
>> As for PoC, I implemented parallel vacuum so that each worker
>> processes both 1 and 2 phases for particular block range.
>> Suppose we vacuum 1000 blocks table with 4 workers, each worker
>> processes 250 consecutive blocks in phase 1 and then reclaims dead
>> tuples from heap and indexes (phase 2).
>> To use visibility map efficiency, each worker scan particular block
>> range of relation and collect dead tuple locations.
>> After each worker finished task, the leader process gathers these
>> vacuum statistics information and update relfrozenxid if possible.
>
> This doesn't seem like a good design, because it adds a lot of extra
> index scanning work.  What I think you should do is:
>
> 1. Use a parallel heap scan (heap_beginscan_parallel) to let all
> workers scan in parallel.  Allocate a DSM segment to store the control
> structure for this parallel scan plus an array for the dead tuple IDs
> and a lock to protect the array.
>
> 2. When you finish the heap scan, or when the array of dead tuple IDs
> is full (or very nearly full?), perform a cycle of index vacuuming.
> For now, have each worker process a separate index; extra workers just
> wait.  Perhaps use the condition variable patch that I posted
> previously to make the workers wait.  Then resume the parallel heap
> scan, if not yet done.
>
> Later, we can try to see if there's a way to have multiple workers
> work together to vacuum a single index.  But the above seems like a
> good place to start.

Thank you for the advice.
That's a what I thought as an another design, I will change the patch
to this design.

>> I also changed the buffer lock infrastructure so that multiple
>> processes can wait for cleanup lock on a buffer.
>
> You won't need this if you proceed as above, which is probably a good thing.

Right.

>
>> And the new GUC parameter vacuum_parallel_workers controls the number
>> of vacuum workers.
>
> I suspect that for autovacuum there is little reason to use parallel
> vacuum, since most of the time we are trying to slow vacuum down, not
> speed it up.  I'd be inclined, for starters, to just add a PARALLEL
> option to the VACUUM command, for when people want to speed up
> parallel vacuums.  Perhaps
>
> VACUUM (PARALLEL 4) relation;
>
> ...could mean to vacuum the relation with the given number of workers, and:
>
> VACUUM (PARALLEL) relation;
>
> ...could mean to vacuum the relation in parallel with the system
> choosing the number of workers - 1 worker per index is probably a good
> starting formula, though it might need some refinement.

It looks convenient.
I was thinking that we can manage the number of parallel worker per
table using this parameter for autovacuum , like
ALTER TABLE relation SET (parallel_vacuum_workers = 2)

Regards,

--
Masahiko Sawada



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Slowness of extended protocol
Next
From: Robert Haas
Date:
Subject: Re: Block level parallel vacuum WIP