Thread: Recovery performance of standby for multiple concurrent truncateson large tables

Hello hackers,

 

Recently, the problem on improving performance of multiple drop/truncate tables in a single transaction with large shared_buffers (as shown below) was solved by commit b416691.

              BEGIN;

              truncate tbl001;

              ...

              truncate tbl050;

              COMMIT;

 

However, we have a customer that needs to execute multiple concurrent TRUNCATEs (40~50) on different large tables (as shown below) in a shorter amount of time. This one is not covered by the previous commit's improvement.

              BEGIN;

              truncate tbl001;

              COMMIT;

              ...

              BEGIN;

              truncate tbl050;

              COMMIT;


[Problem]

Currently, when the standby recovers the WAL of TRUNCATE/DROP TABLE, it leads to separate scans of the whole shared buffer in sequence to check whether or not the table to be deleted is cached in the shared buffer. Moreover, if the size of shared_buffers is large (i.e. 300GB) and the primary server fails during the replay, it would take a long while for the standby to complete recovery.

 

[Idea]

Since in the current implementation, the replay of each TRUNCATE/DROP TABLE scans the whole shared buffer.

One approach (though idea is not really developed yet) is to improve the recovery by delaying the shared buffer scan and invalidation (DropRelFileNodeBuffers) and to put it after the next checkpoint (after failover completion). The replay of TRUNCATE/DROP TABLE just make the checkpointer process remember what relations should be invalidated in the shared buffer during subsequent checkpoint. The checkpointer then scans the shared buffer only once to invalidate the buffers of relations that was dropped and truncated.

 

However, this is still a rough idea, so I am not sure if it’s feasible. I would like to know if the community has advice or other alternative solutions on how to work around this.

Any insights, advice, feedback?

Thank you in advance.

 

 

Regards,

Kirk Jamison

Hi,

On 2018-07-10 07:05:12 +0000, Jamison, Kirk wrote:
> Hello hackers,
> 
> Recently, the problem on improving performance of multiple drop/truncate tables in a single transaction with large
shared_buffers(as shown below) was solved by commit b416691.
 
>               BEGIN;
>               truncate tbl001;
>               ...
>               truncate tbl050;
>               COMMIT;
> 
> However, we have a customer that needs to execute multiple concurrent TRUNCATEs (40~50) on different large tables (as
shownbelow) in a shorter amount of time. This one is not covered by the previous commit's improvement.
 
>               BEGIN;
>               truncate tbl001;
>               COMMIT;
>               ...
>               BEGIN;
>               truncate tbl050;
>               COMMIT;
> 
> [Problem]
> Currently, when the standby recovers the WAL of TRUNCATE/DROP TABLE, it leads to separate scans of the whole shared
bufferin sequence to check whether or not the table to be deleted is cached in the shared buffer. Moreover, if the size
ofshared_buffers is large (i.e. 300GB) and the primary server fails during the replay, it would take a long while for
thestandby to complete recovery.
 

If you do so sequentially on the primary, I'm not clear as to why you
think the issue is bigger in recovery?


> [Idea]
> Since in the current implementation, the replay of each TRUNCATE/DROP TABLE scans the whole shared buffer.
> One approach (though idea is not really developed yet) is to improve the recovery by delaying the shared buffer scan
andinvalidation (DropRelFileNodeBuffers) and to put it after the next checkpoint (after failover completion). The
replayof TRUNCATE/DROP TABLE just make the checkpointer process remember what relations should be invalidated in the
sharedbuffer during subsequent checkpoint. The checkpointer then scans the shared buffer only once to invalidate the
buffersof relations that was dropped and truncated.
 

I think you'd run into a lot of very hairy details with this
approach. Consider what happens if client processes need fresh buffers
and need to write out a victim buffer. You'll need to know that the
relevant buffer is actually invalid. Thus the knowledge about the
"delayed" drops would need to be in shared buffers and scanned on every
dirty buffer writeout.


> However, this is still a rough idea, so I am not sure if it’s feasible. I would like to know if the community has
adviceor other alternative solutions on how to work around this.
 
> Any insights, advice, feedback?

I personally think we should rather just work towards a ordered buffer
mapping implementation.

Greetings,

Andres Freund


On Tue, Jul 10, 2018 at 10:05 AM Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:

Since in the current implementation, the replay of each TRUNCATE/DROP TABLE scans the whole shared buffer.

One approach (though idea is not really developed yet) is to improve the recovery by delaying the shared buffer scan and invalidation (DropRelFileNodeBuffers) and to put it after the next checkpoint (after failover completion). The replay of TRUNCATE/DROP TABLE just make the checkpointer process remember what relations should be invalidated in the shared buffer during subsequent checkpoint. The checkpointer then scans the shared buffer only once to invalidate the buffers of relations that was dropped and truncated.


How about using the background writer for this? It seems to me that the main reason to invalidate buffers would be to free them up for buffer allocation, which is precisely the task of background writer. When adding a filenode to be invalidated, take note of bgwriter position and add it to a queue. When bgwriter is advancing, check each buffer tag against a hash table of filenodes being invalidated. When background writer has completed a loop it can remove the invalidated filenode. When bgwriter falls behind the clock sweep and there are filenodes to invalidate it should run the invalidation scan instead of skipping ahead. If there are already too many filenodes being invalidated, then whoever is trying to add a new one gets to run the invalidation scan until something can be evicted.

--
Ants Aasma
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com/

Hi,

Thank you for your replies.

 

On Tue, July 10, 2018 4:15 PM, Andres Freund wrote:

>I think you'd run into a lot of very hairy details with this approach. Consider what happens if client processes need fresh buffers and need to write out a victim buffer. You'll need to know that the relevant buffer is actually invalid. Thus the knowledge about the "delayed" drops would need to be in shared buffers and scanned on every dirty buffer writeout.

 

Yes, you're right. There are problems associated with checkpoint as you pointed out. I just thought of touching the involved process on writing dirty pages to the disk, which are both the functions of checkpoint and background writer.

> How about using the background writer for this? ...

> ...

And now that I think about it, the suggestion of Ants above on "background writer" would seem to work better, as bg writer is more active and continuously write dirty pages but checkpointer only does it in interval. This seems to be a more viable solution and I'd appreciate advice. I'll explore around this idea, and if possible, I'd like to develop a solution for the next commitfest.

 

> I personally think we should rather just work towards a ordered buffer mapping implementation.

 

I understand that the main problem lies on shared_buffers scanning and that a buffer mapping implementation is underway(?) for PG12. I am not sure if the community has arrived at a consensus for that long-term fix. Would the community also welcome any minor/smaller-scale improvements (just for this particular problem on wal recovery for truncate table)? If yes, then I'd like to work on a possible solution.

 

Regards,

Kirk Jamison

On 2018-07-19 00:53:14 +0000, Jamison, Kirk wrote:
> Hi,
> Thank you for your replies.
> 
> On Tue, July 10, 2018 4:15 PM, Andres Freund wrote:
> >I think you'd run into a lot of very hairy details with this approach. Consider what happens if client processes
needfresh buffers and need to write out a victim buffer. You'll need to know that the relevant buffer is actually
invalid.Thus the knowledge about the "delayed" drops would need to be in shared buffers and scanned on every dirty
bufferwriteout.
 
> 
> Yes, you're right. There are problems associated with checkpoint as you pointed out. I just thought of touching the
involvedprocess on writing dirty pages to the disk, which are both the functions of checkpoint and background writer.
 
> > How about using the background writer for this? ...
> > ...
> And now that I think about it, the suggestion of Ants above on "background writer" would seem to work better, as bg
writeris more active and continuously write dirty pages but checkpointer only does it in interval. This seems to be a
moreviable solution and I'd appreciate advice. I'll explore around this idea, and if possible, I'd like to develop a
solutionfor the next commitfest.
 

It'd have very similar problems as I pointed out above.


> > I personally think we should rather just work towards a ordered buffer mapping implementation.
> 
> I understand that the main problem lies on shared_buffers scanning and that a buffer mapping implementation is
underway(?)for PG12. I am not sure if the community has arrived at a consensus for that long-term fix. Would the
communityalso welcome any minor/smaller-scale improvements (just for this particular problem on wal recovery for
truncatetable)? If yes, then I'd like to work on a possible solution.
 

I think this is a case where the potential work arounds are complex
enough to use significant resources to get right, and are likely to make
properly fixing the issue harder. I'm willing to comment on proposals
that claim not to be problmatic in those regards, but I have *SERIOUS*
doubts they exist.

Greetings,

Andres Freund


Hi Andres,

> I think this is a case where the potential work arounds are complex enough to use significant resources to get right,
andare likely to make properly fixing the issue harder. I'm willing to comment on proposals that claim not to be
problmaticin those regards, but I have *SERIOUS* doubts they exist.
 

Alright. But I'd still try and ask your thoughts about it (below).
The proposed design touches the buffer invalidation during recovery process of standby server.

The question was about "how" to remember those buffers that contain truncate/drop tables, right?

1. Because the multiple scans of the whole shared buffer per concurrent truncate/drop table was the cause of the
time-consumingbehavior, DURING the failover process while WAL is being applied, we temporary delay the scanning and
invalidatingof shared buffers. At the same time, we remember the relations/relfilenodes (of dropped/truncated tables)
byadding them in a hash table called "skip list". 
 
2. After WAL is applied, the checkpoint(or bg writer) scans the shared buffer only ONCE, compare the pages against the
skiplist, and invalidates the relevant pages. After deleting the relevant pages on the shared memory, it will not be
writtenback to the disk.
 

Assuming the theory works, this design will only affect the behavior of checkpointer (or maybe bg writer) during
recoveryprocess / failover. Any feedback, thoughts?
 

BTW, are there any updates whether the community will push through anytime soon regarding the buffer mapping
implementationyou mentioned?
 


Regards,
Kirk Jamison



On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:
> 1. Because the multiple scans of the whole shared buffer per concurrent truncate/drop table was the cause of the
time-consumingbehavior, DURING the failover process while WAL is being applied, we temporary delay the scanning and
invalidatingof shared buffers. At the same time, we remember the relations/relfilenodes (of dropped/truncated tables)
byadding them in a hash table called "skip list". 
> 2. After WAL is applied, the checkpoint(or bg writer) scans the shared buffer only ONCE, compare the pages against
theskip list, and invalidates the relevant pages. After deleting the relevant pages on the shared memory, it will not
bewritten back to the disk. 
>
> Assuming the theory works, this design will only affect the behavior of checkpointer (or maybe bg writer) during
recoveryprocess / failover. Any feedback, thoughts? 

How would this work if a relfilenode number that belonged to an old
relation got recycled for a new relation?

I think something like this could be made to work -- both on the
master and the standby, and not just while waiting for a failover --
if we did something like this:

(1) Limit the number of deferred drops to a reasonably small number
(one cache line?  1kB?).
(2) Background writer regularly does scans do clean out everything
from the deferred-drop list.
(3) If a foreground process finds the deferred-drop list is full when
it needs to add to it, it forces a clean-out of the list contents +
whatever new stuff it has in a single pass.
(4) If we are about generate a relfilenode that's in the list, we
either force a clean out or generate a different relfilenode instead.
(5) Every buffer cleaning operation checks the deferred-drop list and
invalidates without write-out if the buffer is found.

It's not clear to me whether it would be worth the overhead of doing
something like this.  Making relation drops faster at the cost of
making buffer cleaning slower could be a loser.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Hi,

On 2018-07-30 16:01:53 -0400, Robert Haas wrote:
> (1) Limit the number of deferred drops to a reasonably small number
> (one cache line?  1kB?).

Yea, you'd have to, because we'd frequently need to check it, and it'd
need to be in shared memory. But that'd still leave us to regress to
O(n^2) as soon as a transaction goes over that limit - which I don't
think is that infrequent...

Greetings,

Andres Freund


Hi,

On 2018-07-30 05:22:48 +0000, Jamison, Kirk wrote:
> BTW, are there any updates whether the community will push through
> anytime soon regarding the buffer mapping implementation you
> mentioned?

I'm continuing to work on it, but unfortunately there's a couple
projects that have higher priority atm :(.  I'm doubtful I can have a
patchset in a committable shape for v12, but I'm pretty sure I'll have
it in a shape good enough to make progress towards v13.  Sorry :(

Greetings,

Andres Freund


On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:
>> 1. Because the multiple scans of the whole shared buffer per concurrent truncate/drop table was the cause of the
time-consumingbehavior, DURING the failover process while WAL is being applied, we temporary delay the scanning and
invalidatingof shared buffers. At the same time, we remember the relations/relfilenodes (of dropped/truncated tables)
byadding them in a hash table called "skip list". 
>> 2. After WAL is applied, the checkpoint(or bg writer) scans the shared buffer only ONCE, compare the pages against
theskip list, and invalidates the relevant pages. After deleting the relevant pages on the shared memory, it will not
bewritten back to the disk. 
>>
>> Assuming the theory works, this design will only affect the behavior of checkpointer (or maybe bg writer) during
recoveryprocess / failover. Any feedback, thoughts? 
>
> How would this work if a relfilenode number that belonged to an old
> relation got recycled for a new relation?
>
> I think something like this could be made to work -- both on the
> master and the standby, and not just while waiting for a failover --
> if we did something like this:
>
> (1) Limit the number of deferred drops to a reasonably small number
> (one cache line?  1kB?).
> (2) Background writer regularly does scans do clean out everything
> from the deferred-drop list.
> (3) If a foreground process finds the deferred-drop list is full when
> it needs to add to it, it forces a clean-out of the list contents +
> whatever new stuff it has in a single pass.
> (4) If we are about generate a relfilenode that's in the list, we
> either force a clean out or generate a different relfilenode instead.
> (5) Every buffer cleaning operation checks the deferred-drop list and
> invalidates without write-out if the buffer is found.
>
> It's not clear to me whether it would be worth the overhead of doing
> something like this.  Making relation drops faster at the cost of
> making buffer cleaning slower could be a loser.

Perhaps you could make it a bit more incremental, avoiding the full
clean-out scans (your points 2 and 3) while still allowing the
background writer to bear the brunt in the best case?  Something like
this:

The zombie relfilenode tracker could be fixed-size and self-cleaning:
entries could be dropped when the CLOCK hand completes a full rotation
since the entry was created, and if it's full when you're trying to
insert a new entry you can scan the buffer pool until you've caused
one entry (the oldest, by clock hand-at-time-of-insertion) to be
remove to make space (so the worst case is a full pass, but hopefully
we can drop one entry before then).  The main argument I can think of
against that idea is that it is tied to the current buffer eviction
strategy, using its buffer index-based clock hand as a horizon.
Extracting a useful horizon for algorithms like ARC or CAR would
probably need more thought, because their 'hands' jump around
following next pointers.

Anyway, it's a lot of complexity, and it falls back to a worst cases
like today, and can also transfer work to innocent foreground
processes.  I see why Andres says we should just get a better data
structure so we can make the guy doing the dropping pay for it up
front, but more efficiently.  I suspect we may also want an ordered
data structure for write clustering purposes one day.

--
Thomas Munro
http://www.enterprisedb.com


RE: Recovery performance of standby for multiple concurrenttruncates on large tables

From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> It's not clear to me whether it would be worth the overhead of doing
> something like this.

Quite frankly, not really to me, too.

> Making relation drops faster at the cost of
> making buffer cleaning slower could be a loser.

The purpose is not making relation drops faster (on the primary), but keeping failover time within 10 seconds.  I don't
reallyknow how crucial that requirement is, but I'm feeling it would be good for PostgreSQL to be able to guarantee
shorterfailover time.
 


Regards
Takayuki Tsunakawa




RE: Recovery performance of standby for multiple concurrenttruncates on large tables

From
"Tsunakawa, Takayuki"
Date:
From: 'Andres Freund' [mailto:andres@anarazel.de]
> I'm continuing to work on it, but unfortunately there's a couple
> projects that have higher priority atm :(.  I'm doubtful I can have a
> patchset in a committable shape for v12, but I'm pretty sure I'll have
> it in a shape good enough to make progress towards v13.  Sorry :(

We'd like to do it for PG 12, if Kirk's current proposal doesn't find a good landing point.  Could you tell us about
howmany lines of code you predict, and what part would be difficult?  Is there any software you're referring to (e.g.
Linux'sfsync, MySQL, etc.)?
 

Regards
Takayuki Tsunakawa





Hi, 
I appreciate the feedback and suggestions.

On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> How would this work if a relfilenode number that belonged to an old 
>> relation got recycled for a new relation?
>> ..
>> I think something like this could be made to work -- both on the 
>> master and the standby, and not just while waiting for a failover --
>> ...
>> It's not clear to me whether it would be worth the overhead of doing 
>> something like this.  Making relation drops faster at the cost of 
>> making buffer cleaning slower could be a loser.

The deferred list has the information such as relfilenode and the head/top 
page number of invalid pages. The standby in the promote mode only registers
info in the deferred list when redoing the COMMIT record. However, it scans
the shared buffer cache only once before the end of the promote, and discards
the page related to the table included in the deferred list. After removing, 
increment the head page number of the abovementioned invalid page.
In this way, if ever while promoting an INSERT progresses (I'm not sure if 
it's possible), the discard progresses too, & the app can refer to it.

As mentioned by Tsunakawa-san above, the purpose of the design is to make
the failover process as shorter as possible and not really to make the 
drop of relations faster. My initial thought was not to touch the regular 
buffer invalidation process, but I am also not sure of the level of 
complexity that the proposed design would affect and how crucial the
requirement (shorter failover) would make, so I asked for community's feedback.


On Tue, July 31, 2018 8:27 AM, Thomas Munro wrote:
> Anyway, it's a lot of complexity, and it falls back to a worst cases 
> like today, and can also transfer work to innocent foreground processes.
> I see why Andres says we should just get a better data structure so we 
> can make the guy doing the dropping pay for it up front, but more 
> efficiently.  I suspect we may also want an ordered data structure for 
> write clustering purposes one day.

I also understand the value of solving the root cause of the problem that's why 
I asked Andres if we could expect a development from the community for V12
regarding the radix tree approach for buffer management, or even just from anyone who
could start a WIP patch regardless if it's radix tree or not.
And perhaps we'd like to get involved as well as this is also our customer's problem.

So I am just curious about the radix tree approach's design. Maybe we should
start discussing what kind of data structures, processing, etc. are involved?

I also read other design solutions from another thread [1].
a. Fujii-san's proposal
Add $SUBJECT for performance improvement; reloption to prevent vacuum 
from truncating empty pages
b. Pavan's comment
> What if we remember the buffers as seen by count_nondeletable_pages() and
> then just discard those specific buffers instead of scanning the entire
> shared_buffers again? Surely we revisit all to-be-truncated blocks before
> actual truncation. So we already know which buffers to discard. And we're
> holding exclusive lock at that point, so nothing can change underneath. Of
> course, we can't really remember a large number of buffers, so we can do
> this in small chunks. Scan last K blocks, remember those K buffers, discard
> those K buffers, truncate the relation and then try for next K blocks. If
> another backend requests lock on the table, we give up or retry after a
> while.

[1]
https://www.postgresql.org/message-id/flat/20180419203802.hqrb4o2wexjnb2ft%40alvherre.pgsql#7d4a8c56a01392a3909b2150371e6495

Now, how do we move forward?


Thank you everyone.

Regards,
Kirk Jamison