Home > mailing lists

Re: Recovery performance of standby for multiple concurrent truncateson large tables - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: Recovery performance of standby for multiple concurrent truncateson large tables
Date	July 31, 2018 02:26:32
Msg-id	CAEepm=24n+QNzL6hCkgBN9uvCEys2_U9DhSR00uX0XcmQ=C9iw@mail.gmail.com Whole thread
In response to	Re: Recovery performance of standby for multiple concurrent truncateson large tables (Robert Haas <robertmhaas@gmail.com>)
Responses	RE: Recovery performance of standby for multiple concurrenttruncates on large tables
List	pgsql-hackers

Tree view

On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:
>> 1. Because the multiple scans of the whole shared buffer per concurrent truncate/drop table was the cause of the
time-consumingbehavior, DURING the failover process while WAL is being applied, we temporary delay the scanning and
invalidatingof shared buffers. At the same time, we remember the relations/relfilenodes (of dropped/truncated tables)
byadding them in a hash table called "skip list". 
>> 2. After WAL is applied, the checkpoint(or bg writer) scans the shared buffer only ONCE, compare the pages against
theskip list, and invalidates the relevant pages. After deleting the relevant pages on the shared memory, it will not
bewritten back to the disk. 
>>
>> Assuming the theory works, this design will only affect the behavior of checkpointer (or maybe bg writer) during
recoveryprocess / failover. Any feedback, thoughts? 
>
> How would this work if a relfilenode number that belonged to an old
> relation got recycled for a new relation?
>
> I think something like this could be made to work -- both on the
> master and the standby, and not just while waiting for a failover --
> if we did something like this:
>
> (1) Limit the number of deferred drops to a reasonably small number
> (one cache line?  1kB?).
> (2) Background writer regularly does scans do clean out everything
> from the deferred-drop list.
> (3) If a foreground process finds the deferred-drop list is full when
> it needs to add to it, it forces a clean-out of the list contents +
> whatever new stuff it has in a single pass.
> (4) If we are about generate a relfilenode that's in the list, we
> either force a clean out or generate a different relfilenode instead.
> (5) Every buffer cleaning operation checks the deferred-drop list and
> invalidates without write-out if the buffer is found.
>
> It's not clear to me whether it would be worth the overhead of doing
> something like this.  Making relation drops faster at the cost of
> making buffer cleaning slower could be a loser.

Perhaps you could make it a bit more incremental, avoiding the full
clean-out scans (your points 2 and 3) while still allowing the
background writer to bear the brunt in the best case?  Something like
this:

The zombie relfilenode tracker could be fixed-size and self-cleaning:
entries could be dropped when the CLOCK hand completes a full rotation
since the entry was created, and if it's full when you're trying to
insert a new entry you can scan the buffer pool until you've caused
one entry (the oldest, by clock hand-at-time-of-insertion) to be
remove to make space (so the worst case is a full pass, but hopefully
we can drop one entry before then).  The main argument I can think of
against that idea is that it is tied to the current buffer eviction
strategy, using its buffer index-based clock hand as a horizon.
Extracting a useful horizon for algorithms like ARC or CAR would
probably need more thought, because their 'hands' jump around
following next pointers.

Anyway, it's a lot of complexity, and it falls back to a worst cases
like today, and can also transfer work to innocent foreground
processes.  I see why Andres says we should just get a better data
structure so we can make the guy doing the dropping pay for it up
front, but more efficiently.  I suspect we may also want an ordered
data structure for write clustering purposes one day.

--
Thomas Munro
http://www.enterprisedb.com

pgsql-hackers by date:

From: Alvaro Herrera
Date: 31 July 2018, 02:25:54
Subject: Re: FailedAssertion on partprune

From: Tom Lane
Date: 31 July 2018, 02:42:00
Subject: Re: make installcheck-world in a clean environment

Re: Recovery performance of standby for multiple concurrent truncateson large tables - Mailing list pgsql-hackers

Previous

Next