Thread: [PATCH] lazy relations delete

[PATCH] lazy relations delete

From
Maxim Orlov
Date:
Hi!

Here is the case.

Assume we have a master to slave replication with shared_buffers set up 
to 2 GB at the master and 4 GB at the slave. All of the data is written 
to the master, while reading occurs from slave.

Now we decided to drop many tables, let's say 1000 or 10000 not in a 
single transaction, but each table in a separate one. So, due to "plain" 
shared_buffers memory we have to do for loop for every relation which 
leads to lag between master and slave.

In real case scenario such issue lead to not a minutes lag, but hours 
lag. At the same time PostgreSQL have a great routine to delete many 
relations in a single transaction.

So, to get rid of this kind of issue here came up an idea: what if not 
to delete everyone of relations right away and just store them in an 
array, prevent shared buffers (correspond to a deleted relations) from 
been flushed. And then array reaches it max size we need to walk all 
buffers only once to "free" shared buffers correspond to a deleted 
relations.

Here some values from the test which I am made.

Without patch:

1.

(master 2 GB) - drop 1000 tables took 6 sec

(slave 4 GB) - drop 1000 tables took 8 sec

2.

(master 4 GB) - drop 1000 tables took 10 sec

(slave 8 GB) - drop 1000 tables took 16 sec

3.

(master 10 GB) - drop 1000 tables took 22 sec

(slave 20 GB) - drop 1000 tables took 38 sec


With patch:

1.

(master 2 GB) - drop 1000 tables took 2 sec

(slave 4 GB) - drop 1000 tables took 2 sec

2.

(master 4 GB) - drop 1000 tables took 3 sec

(slave 8 GB) - drop 1000 tables took 3 sec

3.

(master 10 GB) - drop 1000 tables took 4 sec

(slave 20 GB) - drop 1000 tables took 4 sec

-- 
Max Orlov
E-mail: m.orlov@postgrespro.ru


Attachment

Re: [PATCH] lazy relations delete

From
Kyotaro Horiguchi
Date:
Hello.

At Tue, 31 Dec 2019 13:16:49 +0300, Maxim Orlov <m.orlov@postgrespro.ru> wrote in 
> Now we decided to drop many tables, let's say 1000 or 10000 not in a
> single transaction, but each table in a separate one. So, due to
> "plain" shared_buffers memory we have to do for loop for every
> relation which leads to lag between master and slave.
> 
> In real case scenario such issue lead to not a minutes lag, but hours
> lag. At the same time PostgreSQL have a great routine to delete many
> relations in a single transaction.
> 
> So, to get rid of this kind of issue here came up an idea: what if not
> to delete everyone of relations right away and just store them in an
> array, prevent shared buffers (correspond to a deleted relations) from
> been flushed. And then array reaches it max size we need to walk all
> buffers only once to "free" shared buffers correspond to a deleted
> relations.

That is a greate performane gain, but the proposal seems to lead to
database corruption. We must avoid such cases.

Relfilenode can be reused right after commit. There can be a case
where readers of the resued relfilenode see the pages from already
removed files left on shared buffers. On the other hand newly
allocated buffers for the reused relfilenode are not flushed out until
the lazy invalidate machinery actually frees the "garbage" buffers and
it leads to a broken database after a crash.  But finally the
machinery trashes away the buffers involving the correct ones at
execution time.

As for performance, hash reference for every BufferFlush call could be
a cost for unrelated transactions. And it leaves garbage buffers as
dead until more than LAZY_DELETE_ARRAY_SIZE relfilenodes are
removed.


regares.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [PATCH] lazy relations delete

From
Thomas Munro
Date:
On Wed, Jan 8, 2020 at 5:20 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Relfilenode can be reused right after commit. There can be a case
> where readers of the resued relfilenode see the pages from already
> removed files left on shared buffers. On the other hand newly
> allocated buffers for the reused relfilenode are not flushed out until
> the lazy invalidate machinery actually frees the "garbage" buffers and
> it leads to a broken database after a crash.  But finally the
> machinery trashes away the buffers involving the correct ones at
> execution time.

The relfilenode can't be reused until the next checkpoint, can it?
The truncated file remains in the file system, specifically to prevent
anyone from reusing the relfilenode.  See the comment for mdunlink().
There may be other problems with the idea, but wouldn't the zombie
buffers be harmless, if they are invalidated before
SyncPostCheckpoint() unlinks the underlying files (and you never try
to flush them)?