Thread: [PATCH] lazy relations delete
Hi! Here is the case. Assume we have a master to slave replication with shared_buffers set up to 2 GB at the master and 4 GB at the slave. All of the data is written to the master, while reading occurs from slave. Now we decided to drop many tables, let's say 1000 or 10000 not in a single transaction, but each table in a separate one. So, due to "plain" shared_buffers memory we have to do for loop for every relation which leads to lag between master and slave. In real case scenario such issue lead to not a minutes lag, but hours lag. At the same time PostgreSQL have a great routine to delete many relations in a single transaction. So, to get rid of this kind of issue here came up an idea: what if not to delete everyone of relations right away and just store them in an array, prevent shared buffers (correspond to a deleted relations) from been flushed. And then array reaches it max size we need to walk all buffers only once to "free" shared buffers correspond to a deleted relations. Here some values from the test which I am made. Without patch: 1. (master 2 GB) - drop 1000 tables took 6 sec (slave 4 GB) - drop 1000 tables took 8 sec 2. (master 4 GB) - drop 1000 tables took 10 sec (slave 8 GB) - drop 1000 tables took 16 sec 3. (master 10 GB) - drop 1000 tables took 22 sec (slave 20 GB) - drop 1000 tables took 38 sec With patch: 1. (master 2 GB) - drop 1000 tables took 2 sec (slave 4 GB) - drop 1000 tables took 2 sec 2. (master 4 GB) - drop 1000 tables took 3 sec (slave 8 GB) - drop 1000 tables took 3 sec 3. (master 10 GB) - drop 1000 tables took 4 sec (slave 20 GB) - drop 1000 tables took 4 sec -- Max Orlov E-mail: m.orlov@postgrespro.ru
Attachment
Hello. At Tue, 31 Dec 2019 13:16:49 +0300, Maxim Orlov <m.orlov@postgrespro.ru> wrote in > Now we decided to drop many tables, let's say 1000 or 10000 not in a > single transaction, but each table in a separate one. So, due to > "plain" shared_buffers memory we have to do for loop for every > relation which leads to lag between master and slave. > > In real case scenario such issue lead to not a minutes lag, but hours > lag. At the same time PostgreSQL have a great routine to delete many > relations in a single transaction. > > So, to get rid of this kind of issue here came up an idea: what if not > to delete everyone of relations right away and just store them in an > array, prevent shared buffers (correspond to a deleted relations) from > been flushed. And then array reaches it max size we need to walk all > buffers only once to "free" shared buffers correspond to a deleted > relations. That is a greate performane gain, but the proposal seems to lead to database corruption. We must avoid such cases. Relfilenode can be reused right after commit. There can be a case where readers of the resued relfilenode see the pages from already removed files left on shared buffers. On the other hand newly allocated buffers for the reused relfilenode are not flushed out until the lazy invalidate machinery actually frees the "garbage" buffers and it leads to a broken database after a crash. But finally the machinery trashes away the buffers involving the correct ones at execution time. As for performance, hash reference for every BufferFlush call could be a cost for unrelated transactions. And it leaves garbage buffers as dead until more than LAZY_DELETE_ARRAY_SIZE relfilenodes are removed. regares. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Jan 8, 2020 at 5:20 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Relfilenode can be reused right after commit. There can be a case > where readers of the resued relfilenode see the pages from already > removed files left on shared buffers. On the other hand newly > allocated buffers for the reused relfilenode are not flushed out until > the lazy invalidate machinery actually frees the "garbage" buffers and > it leads to a broken database after a crash. But finally the > machinery trashes away the buffers involving the correct ones at > execution time. The relfilenode can't be reused until the next checkpoint, can it? The truncated file remains in the file system, specifically to prevent anyone from reusing the relfilenode. See the comment for mdunlink(). There may be other problems with the idea, but wouldn't the zombie buffers be harmless, if they are invalidated before SyncPostCheckpoint() unlinks the underlying files (and you never try to flush them)?